[R] Parsing JSON records to a dataframe

Martin Morgan mtmorgan at fhcrc.org
Fri Jan 7 14:17:55 CET 2011


On 01/07/2011 12:05 AM, Dieter Menne wrote:
> 
> 
> Jeroen Ooms wrote:
>>
>> What is the most efficient method of parsing a dataframe-like structure
>> that has been json encoded in record-based format rather than vector
>> based. For example a structure like this:
>>
>> [ {"name":"joe", "gender":"male", "age":41}, {"name":"anna",
>> "gender":"female", "age":23} ]
>>
>> RJSONIO parses this as a list of lists, which I would then have to apply
>> as.data.frame to and append them to an existing dataframe, which is
>> terribly slow. 
>>
>>
> 
> unlist is pretty fast. The solution below assumes that you know how your
> structure is, so it is not very flexible, but it should show you that the
> conversion to data.frame is not the bottleneck.
> 
> # json
> library(RJSONIO)
> # [ {"name":"joe", "gender":"male", "age":41},
> #  {"name":"anna", "gender":"female", "age":23} ]
> n = 300000
> d = data.frame(name=rep(c("joe","anna"),n),
>            gender=rep(c("male","female"),n),
>            age = rep(c("23","41"),n))
> dj = toJSON(d)

This doesn't create the required structure

> cat(dj)
{
 "name": [ "joe", "anna", "joe", "anna" ],
   "gender": [ "male", "female", "male", "female" ],
   "age": [ "23", "41", "23", "41" ]
}

instead

library(rjson)
n <- 1000
name <- apply(matrix(sample(letters, n * 5, TRUE), n),
              1, paste, collapse="")
gender <- sample(c("male", "female"), n, TRUE)
age <- ceiling(runif(n, 20, 60))
recs <- sprintf('{"name": "%s", "gender":"%s", "age":%d}',
                name, gender, age)
j <- sprintf("[%s]", paste(recs, collapse=","))
lol <- fromJSON(j)

and then with

f <- function(lst)
    function(nm) unlist(lapply(lst, "[[", nm), use.names=FALSE)

> oopt <- options(stringsAsFactors=FALSE) # convenience for 'identical'
> system.time({
+     df0 <- as.data.frame(Map(f(lol), names(lol[[1]])))
+ })
   user  system elapsed
  0.006   0.000   0.006

versus for instance

> system.time({
+     df1 <- do.call(rbind, lapply(lol, data.frame))
+ })
   user  system elapsed
  1.497   0.000   1.500
> identical(df0, df1)
[1] TRUE

Martin


> 
> system.time(d1 <- fromJSON(dj))
> #  user  system elapsed
> #   4.06    0.26    4.32
> 
> system.time(
>   dd <- data.frame(
>     name = unlist(d1$name),
>     gender = unlist(d1$gender),
>     age=as.numeric(unlist(d1$age)))
> )
> #   user  system elapsed
> #   1.13    0.05    1.18
> 
> 
> 
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-help mailing list