[R] Reading Json data

Mark Sharp msharp at txbiomed.org
Mon Aug 10 17:46:19 CEST 2015


Mayukh,

I apologize for taking so long to get back to your problem. I expect you may have found the solution. If so I would be interested. I have developed a hack to solve the problem, but I expect if someone knew how to handle JSON objects or even text parsing better they could develop a more elegant solution. 

As I understand the problem, your text file has more than one JSON object in text form. There are three. The first two are very similar and the last is a trailer indication what was done, when it was done and the number of JSON objects sent. The problem is that fromJSON() only pulls off the first of the JSON objects. 

I have defined three helper functions to separate the JSON objects, read them in, and store them in a list.

library(RJSONIO)
library(stringi, quietly = TRUE)
#library(jsonlite) # also works

#' Returns dataframe with ordered locations of the matching braces.
#'
#' There is almost certainly a better function to do this. 
#' @param txt character vector of length one having 0 or more matching braces.
#' @import stringi
#' @examples 
#' library(rmsutilityr)
#' match_braces("{123{456{78}9}10}")
#' @export
match_braces <- function(txt) {
  txt <- txt[1] # just in the case of having more than one element
  left <- stri_locate_all_regex(txt, "\\{")[[1]][ , 1]
  right <- stri_locate_all_regex(txt, "\\}")[[1]][ , 2]
  len <- length(left)
  braces <- data.frame(left = rep(0, len), right = rep(0, len))
  for (i in seq_along(right)) {
    for (j in rev(seq_along(left))) {
      if (left[j] < right[i] & left[j] != 0) {
        braces$left[i] <- left[j]
        braces$right[i] <- right[i]
        left[j] <- 0
        break
      }
    }
  }
  braces[order(braces$left), ]
}

#' Returns a list containing two objects in the text of a character vector 
#' of length one: (1) object = the first json object found and (2) remainder = 
#' the remaining text.
#' 
#'  Properly formed messages are assumed. Error checking is non-existent.
#' @param json_txt character vector of length one having one or more JSON
#' objects in character form.
#' @import stringi
#' @export
get_first_json_message <- function(json_txt) {
  len <- stri_length(json_txt)
  braces <- match_braces(json_txt)
  if (braces$right[1] + 1 > len) {
    remainder <- ""
  } else {
    remainder <- stri_trim_both(stri_sub(json_txt, braces$right[1] + 1))
  }
  list(object = stri_sub(json_txt, braces$left[1], to = braces$right[1]),
       remainder = remainder)
}
#' Returns list of lists made by call to fromJSON()

#' @param json_txt character vector of length 1 having one or more
#' JSON objects in text form.
#' @import stringi
#' @export
get_json_list <- function (json_txt) {
  t_json_txt <- json_txt
  i <- 0
  json_list <- list()
  repeat{
    i <- i + 1
    message_remainder <- get_first_json_message(t_json_txt)
    json_list[i] <- list(fromJSON(message_remainder$object))
    if (message_remainder$remainder == "")
      break
    t_json_txt <- message_remainder$remainder
  }
  json_list
}

json_file <- "../data/json_file.txt"
json_txt <- stri_trim_both(stri_c(readLines(json_file), collapse = " "))
json_list <- get_json_list(json_txt)
length(json_list)


R. Mark Sharp, Ph.D.
Director of Primate Records Database
Southwest National Primate Research Center
Texas Biomedical Research Institute
P.O. Box 760549
San Antonio, TX 78245-0549
Telephone: (210)258-9476
e-mail: msharp at TxBiomed.org







> On Jul 27, 2015, at 5:16 PM, Mark Sharp <msharp at TxBiomed.org> wrote:
> 
> Mayukh,
> 
> I think you are missing an argument to paste() and a right parenthesis character.
> 
> Try 
> json_data <- fromJSON(paste(readLines(json_file), collapse = " "))
> 
> Mark
> R. Mark Sharp, Ph.D.
> msharp at TxBiomed.org
> 
> 
> 
> 
> 
>> On Jul 27, 2015, at 3:41 PM, Mayukh Dass <mayukh.dass at gmail.com> wrote:
>> 
>> Hello,
>> 
>> I am trying to read a set of json files containing tweets using the
>> following code:
>> 
>> json_data <- fromJSON(paste(readLines(json_file))
>> 
>> Unfortunately, it only reads the first record on the file. For example, in
>> the file below, it only reads the first record starting with "id":"tag:
>> search.twitter.com,2005:3318539389". What is the best way to retrieve these
>> records? I have 20 such json files with varying number of tweets in it.
>> Thank you in advance.
>> 
>> Best,
>> Mayukh
>> 
>> {"id":"tag:search.twitter.com
>> ,2005:3318539389","objectType":"activity","actor":{"objectType":"person","id":"id:
>> twitter.com:2859421","link":"http://www.twitter.com/meetjenn","displayName":"Jenn","postedTime":"2007-01-29T17:06:00.000Z","image":"06-19-07_2010.jpg","summary":"I
>> say 'like' a lot. I fall down a lot. I walk into everything. Love Pgh Pens,
>> NE Pats, Fundraising, Dogs & History. Craft Beer & Running
>> Novice.","links":[{"href":"http://meetjenn.tumblr.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Eastern
>> Time (US &
>> Canada)","verified":false,"utcOffset":"0","preferredUsername":"meetjenn","languages":["en"],"location":{"objectType":"place","displayName":"Pgh/Philajersey"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:12.000Z","generator":{"displayName":"tweetdeck","link":"
>> http://twitter.com
>> "},"provider":{"objectType":"service","displayName":"Twitter","link":"
>> http://www.twitter.com"},"link":"
>> http://twitter.com/meetjenn/statuses/3318539389","body":"Cool story about
>> the man who created the @Starbucks logo. Additional link at the bottom on
>> how it came to be:  http://bit.ly/16bOJk
>> ","object":{"objectType":"note","id":"object:search.twitter.com,2005:3318539389","summary":"Cool
>> story about the man who created the @Starbucks logo. Additional link at the
>> bottom on how it came to be:  http://bit.ly/16bOJk","link":"
>> http://twitter.com/meetjenn/statuses/3318539389
>> ","postedTime":"2009-08-15T00:00:12.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[111,131],"url":"
>> http://bit.ly/16bOJk
>> "}],"hashtags":[],"user_mentions":[{"id":null,"name":null,"indices":[41,51],"screen_name":"@Starbucks","id_str":null}]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
>> {"id":"tag:search.twitter.com
>> ,2005:3318543260","objectType":"activity","actor":{"objectType":"person","id":"id:
>> twitter.com:61595468","link":"http://www.twitter.com/FastestFood","displayName":"FastFood
>> Bob","postedTime":"2009-01-30T20:51:10.000Z","image":"","summary":"Just A
>> little food for
>> thought","links":[{"href":"http://www.TeamSantilli.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Pacific
>> Time (US &
>> Canada)","verified":false,"utcOffset":"0","preferredUsername":"FastestFood","languages":["en"],"location":{"objectType":"place","displayName":"eating
>> some
>> thoughts"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:23.000Z","generator":{"displayName":"oauth:17","link":"
>> http://twitter.com
>> "},"provider":{"objectType":"service","displayName":"Twitter","link":"
>> http://www.twitter.com"},"link":"
>> http://twitter.com/FastestFood/statuses/3318543260","body":"Oregon Biz
>> Report » How Starbucks saved millions. Oregon closures ...
>> http://u.mavrev.com/02bdj","object":{"objectType":"note","id":"object:
>> search.twitter.com,2005:3318543260","summary":"Oregon Biz Report » How
>> Starbucks saved millions. Oregon closures ... http://u.mavrev.com/02bdj
>> ","link":"http://twitter.com/FastestFood/statuses/3318543260
>> ","postedTime":"2009-08-15T00:00:23.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[70,95],"url":"
>> http://u.mavrev.com/02bdj
>> "}],"hashtags":[],"user_mentions":[]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
>> {"info":{"message":"Replay Request
>> Completed","sent":"2015-02-18T00:05:15+00:00","activity_count":2}}
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list