[R] How do I combine lists of data.frames into a single data frame?

Thu Jul 15 22:52:05 CEST 2010

Ted,

I may not be completely clear on how you have your processes implemented, but some thoughts:

If you will be creating multiple lists initially, where each list (say z1...z4) contains 1 or more data frames and all of the data frames have the same column structure, you can use:

  do.call(rbind, c(z1, z2, z3, z4))

For example, using the iris data set:

  list1 <- list(head(iris), head(iris), head(iris))

  list2 <- list(head(iris), head(iris))

So these now have 3 and 2 copies, respectively, of 6 rows from the iris data set. You can then do:

DF <- do.call(rbind, c(list1, list2))

> str(DF)
'data.frame':	30 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

So DF now contains 30 rows (6 rows * 5 data frames).

I am not sure if that will spark some thoughts, but ideally, if you can figure out a way such that the result of all of your operations will be a single list (eg. within a loop construct), you can avoid the copying of objects, which both adds time and RAM overhead. Then you can just use the do.call(rbind, YourList) construct on the single 'all inclusive' list.  If you need to preallocate a 'master' list object, which you can then index in a loop, presuming that you know ahead of time how many total data frames will be created, you can use vector("list", N), where N is the number of total list elements that you will require. For example:

> vector("list", 5)
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

will preallocate a list of 5 elements, each of which can then be indexed to contain a data frame that is a result of your looping operation.

HTH,

Marc

On Jul 15, 2010, at 2:58 PM, Ted Byers wrote:

> Thanks Marc
> 
> The next part of the question, though, involves the fact that there is a new
> 'z' list made in almost every iteration through the ID loop.
> 
> I guess there are two parts to the question.  First, how would I make a list
> containing all the data frames created by a call to rbind?  I assume, then,
> that I could call rbind again to make that new list into a single
> data.frame.  Second, is it possible to just append one list of objects to
> another list of objects, and would doing that and calling rbind on that
> master list be more efficient than calling rbind on each z list and then
> calling rbind after the loop on the list of such data.frames?
> 
> Thanks again,
> 
> Ted
> 
> On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
> 
>> On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:
>> 
>>> The data.frame is constructed by one of the following functions:
>>> 
>>> funweek <- function(df)
>>> if (length(df$elapsed_time) > 5) {
>>>   rv = fitdist(df$elapsed_time,"exp")
>>>   rv$year = df$sale_year[1]
>>>   rv$sample = df$sale_week[1]
>>>   rv$granularity = "week"
>>>   rv
>>> }
>>> funmonth <- function(df)
>>> if (length(df$elapsed_time) > 5) {
>>>   rv = fitdist(df$elapsed_time,"exp")
>>>   rv$year = df$sale_year[1]
>>>   rv$sample = df$sale_month[1]
>>>   rv$granularity = "month"
>>>   rv
>>> }
>>> 
>>> It is basically the data.frame created by fitdist extended to include the
>>> variables used to distinguish one sample from another.
>>> 
>>> I have the following statement that gets me a set of IDs from my db:
>>> 
>>> ids <- dbGetQuery(con, "SELECT DISTINCT m_id FROM risk_input")
>>> 
>>> And then I have a loop that allows me to analyze one dataset after
>> another:
>>> 
>>> for (i in 1:length(ids[,1])) {
>>> print(i)
>>> print(ids[i,1])
>>> 
>>> Then, after a set of statements that give me information about the
>> dataset
>>> (such as its size), within a conditional block that ensures I apply the
>>> analysis only on sufficiently large samples, I have the following:
>>> 
>>> z <-
>> lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
>>> = TRUE), funweek)
>>> 
>>> or z <-
>>> lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
>>> TRUE), funmonth)
>>> 
>>> followed by:
>>> 
>>> str(z)
>>> 
>>> Of course, I close the loop and disconnect from my db.
>>> 
>>> NB: I don't see any way to get rid of the loop by adding ID as a factor
>> to
>>> split because I have to query the DB for several key bits of data in
>> order
>>> to determine whether or not there is sufficient data to work on.
>>> 
>>> I have everything working, except the final step of storing the results
>> back
>>> into the db.  Storing data in the Db is easy enough.  But I am at a loss
>> as
>>> to how to combine the lists placed in z in most of the iterations through
>>> the ID loop into a single data.frame.
>>> 
>>> Now, I did take a look at rbind and cbind, but it isn't clear to me if
>>> either is appropriate.  All the data frames have the same structure, but
>> the
>>> lists are of variable length, and I am not certain how either might be
>> used
>>> inside the IDs loop.
>>> 
>>> So, what is the best way to combine all lists assigned to z into a single
>>> data.frame?
>>> 
>>> Thanks
>>> 
>>> Ted
>> 
>> 
>> Ted,
>> 
>> If each of the data frames in the list 'z' have the same column structure,
>> you can use:
>> 
>> do.call(rbind, z)
>> 
>> The result of which will be a single data frame containing all of the rows
>> from each of the data frames in the list.
>> 
>> HTH,
>> 
>> Marc Schwartz
>> 
>> 
> 
> 
> -- 
> R.E.(Ted) Byers, Ph.D.,Ed.D.
> TED at MERCHANTSERVICECORP.COM
> CTO
> Merchant Services Corp.
> 350 Harry Walker Parkway North, Suite 8
> Newmarket, Ontario
> L3Y 8L3
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.