[R] on ``unfolding'' a json into data frame columns

Hadley Wickham h.wickham at gmail.com
Tue Nov 29 15:26:44 CET 2016


Two quick hints:

* use simplifyDataFrame = FALSE in fromJSON()

* read https://jennybc.github.io/purrr-tutorial/ls02_map-extraction-advanced.html
(and https://jennybc.github.io/purrr-tutorial/)

Hadley

On Tue, Nov 29, 2016 at 8:06 AM, Daniel Bastos <dbastos at toledo.com> wrote:
> Greetings!
>
> In an SQL table, I have a column that contains a JSON.  I'd like easy
> access to all (in an ideal world) of these JSON fields.  I started out
> trying to get all fields from the JSON and so I wrote this function.
>
> unfold.json <- function (df, column)
> {
>     library(jsonlite)
>     ret <- data.frame()
>
>     for (i in 1:nrow(df)) {
>         js <- fromJSON(df[i, ][[column]])
>         ret <- rbind(ret, cbind(df[i, ], js))
>     }
>
>     ret
> }
>
> It takes a data frame and a column-string where the JSON is to be
> found.  It produces a new RET data frame with all the rows of DF but
> with new columns --- extracted from every field in the JSON.
>
> (The performance is horrible.)
>
> fromJSON sometimes produces a list that sometimes contains a data frame.
> As a result, I end up getting a RET data frame with duplicated rows.
> Here's what happens.
>
>> nrow(df)
> [1] 1
>
>> nrow(unfold.json(df, "response"))
> [1] 3
> Warning messages:
> 1: In data.frame(CreateUTC = "2016-11-29 02:00:43", Payload = list( :
>   row names were found from a short variable and have been discarded
> 2: In data.frame(..., check.names = FALSE) :
>   row names were found from a short variable and have been discarded
>>
>
> I expected a data frame with 1 row.  The reason 3 rows is produced is
> because in the JSON there's an array with 3 rows.
>
>> fromJSON(df$response)$RawPayload
> [1] 200   1 128
>
> I have also cases where fromJSON(df$response)$Payload$Fields is a data
> frame containing various rows.  So unfold.json produces a data frame
> with these various rows.
>
> So I gave up on this general approach.
>
> (*) My humble approach
>
> For the moment I'm not interested in RawPayload nor Payload$Fields, so I
> nullified them in this new approach.  To improve performance, I guessed
> perhaps merge() would help and I think it did, but this was not at all a
> decision thought out.
>
> unfold.json.fast <- function (df, column)
> {
>     library(jsonlite)
>     ret <- data.frame()
>     if (nrow(df) > 0) {
>         for (i in 1:nrow(df)) {
>             ls <- fromJSON(df[i, ][[column]])
>             ls$RawPayload <- NULL
>             ls$Payload$Fields <- NULL
>             js <- data.frame(ls)
>             ret <- rbind(ret, merge(df[i, ], js))
>         }
>     }
>
>     ret
> }
>
> I'm looking for advice.  How would you approach this problem?
>
> Thank you!
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
http://hadley.nz



More information about the R-help mailing list