[R] How to quickly convert a data.frame into a structure of lists

William Dunlap wdunlap at tibco.com
Wed Aug 10 19:04:32 CEST 2011


I was going to suggest
  > AB <- df[c("A","B")]
  > ls2 <- array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
which produces a matrix very similar to what Duncan's by() call produces
  > ls1 <- by(df$C, df[,1:2], identity)
E.g.,
  > ls2[["a","X"]]
  [1] 1 2
  > ls1[["a","X"]]
  [1] 1 2
  > ls1[["a","Y"]] # by assigns NULL to unoccupied slots
  NULL
  > ls2[["a","Y"]] # split gives the same type to all slots, copied from input
  numeric(0)

They both are quick because they use split() to avoid the repeated
evaluations of
  bigVector[ anotherBigVector == scalar ]
that your nested (not imbricated) loops do.  If you really need to convert
the matrix to a list of lists that will probably be a quick transformation.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Duncan Murdoch
> Sent: Wednesday, August 10, 2011 9:43 AM
> To: Frederic F
> Cc: r-help at r-project.org
> Subject: Re: [R] How to quickly convert a data.frame into a structure of lists
> 
> On 10/08/2011 10:30 AM, Frederic F wrote:
> > Hello Duncan,
> >
> > Here is a small example to illustrate what I am trying to do.
> >
> > # Example data.frame
> > df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4))
> > #   A B C
> > # 1 a X 1
> > # 2 a X 2
> > # 3 b Y 3
> > # 4 b Z 4
> >
> > ### First way of getting the list structure (ls1) using imbricated lapply
> > loops:
> > # Get the structure and populate it:
> > ls1<-lapply(levels(df$A), function(levelA) {
> >        lapply(levels(df$B), function(levelB) {df$C[df$A==levelA&
> > df$B==levelB]})
> > })
> > # Apply the names:
> > names(list_structure)<-levels(df$A)
> > for (i in 1:length(list_structure))
> > {names(list_structure[[i]])<-levels(df$B)}
> >
> > # Result:
> > ls1$a$X
> > # [1] 1 2
> > ls1$b$Z
> > # [1] 4
> >
> > The data.frame will always be 'complete', i.e., there will be a value in
> > every row for every column.
> > I want to produce a structure like this one quickly (I aim at something
> > below 10 seconds) for a dataset containing between 1 and 2 millions of rows.
> >
> 
> I don't know what the timing would be like for your real data, but this
> does look like by() would work:
> 
> ls1 <- by(df$C, df[,1:2], identity)
> 
> When I repeat the rows of df a million times each, this finishes in a
> few seconds.  It would definitely be slower if there were more levels of
> A or B.
> 
> Now ls1 will be a matrix whose entries are the subsets of C that you
> want, so you can see your two results with slightly different syntax:
> 
>  > ls1[["a", "X"]]
> [1] 1 2
>  > ls1[["b","Z"]]
> [1] 4
> 
> Duncan Murdoch
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list