[R] merging or joining 2 dataframes: merge, rbind.fill, etc.?

Dennis Murphy djmuser at gmail.com
Thu Feb 28 06:52:56 CET 2013


Hi:

The other day I ran 100K simulations, each of which returned a 20 x 4
data frame. I stored these in a list object. When attempting to rbind
them into a single large data frame, my first thought was to try plyr:

library(plyr)
bigD <- ldply(L, rbind)   # where L is the list object

I quit at around a half hour. Ditto for do.call(rbind, L). [Sorry, I
didn't time it - these are approximate times.] I then checked to see
if the data.table package could do this, and lo and behold, I
discovered the rbindlist() function. When applied to my list object,
it ran correctly in under a second. Here's the actual example with
some names changed to mask the application:

g <- gs[1:100000]   # gs is a list of lists
> length(g)
[1] 100000
> class(g)
[1] "list"
> dim(g[[1]])
[1] 20  4
> dim(g[[100000]])
[1] 20  4
> library(data.table)
> system.time(bigD <- rbindlist(g))
   user  system elapsed
   0.45    0.02    0.47
> dim(bigD)
[1] 2000000       4
> class(bigD)
[1] "data.table" "data.frame"

Dennis

On Tue, Feb 26, 2013 at 7:05 PM, David Kulp <dkulp at fiksu.com> wrote:
> On Feb 26, 2013, at 9:33 PM, Anika Masters <anika.masters at gmail.com> wrote:
>
>> Thanks Arun and David.  Another issue I am running into are memory
>> issues when one of the data frames I'm trying to rbind to or merge
>> with are "very large".  (This is a repetitive  problem, as I am trying
>> to merge/rbind thousands of small dataframes into a single "very
>> large" dataframe.)
>>
>>
>>
>> I'm thinking of creating a function that creates an empty dataframe to
>> which I can add data, but will need to first determine and ensure that
>> each dataframe has the exact same columns, in the exact same
>> "location".
>>
>>
>>
>> Before I write any new code, is there any pre-existing functions or
>> code that might solve this problem of "merging small or medium sized
>> dataframes with a "very large" dataframe.)
>
> Consider plyr. Memory issues can be a problem, but it's a piece of
> cake to write a one liner that iterates over a list of data frames and
> returns them all rbind'd together.  Or just: do.call(rbind,
> list.of.data.frames).
>
> If memory is a serious problem then I think it's best to write your
> own code that appends each row by index - which avoids copying entire
> data frames in memory.
>
>>
>> On Tue, Feb 26, 2013 at 2:00 PM, David L Carlson <dcarlson at tamu.edu> wrote:
>>> Clumsy but it doesn't require any packages:
>>>
>>> merge2 <- function(x, y) {
>>> if(all(union(names(x), names(y)) == intersect(names(x), names(y)))){
>>>    rbind(x, y)
>>>    } else merge(x, y, all=TRUE)
>>> }
>>> merge2(df1, df2)
>>> df3 <- df1
>>> merge2(df1, df3)
>>>
>>> ----------------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77843-4352
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>> project.org] On Behalf Of arun
>>>> Sent: Tuesday, February 26, 2013 1:14 PM
>>>> To: Anika Masters
>>>> Cc: R help
>>>> Subject: Re: [R] merging or joining 2 dataframes: merge, rbind.fill,
>>>> etc.?
>>>>
>>>> Hi,
>>>>
>>>> You could also try:
>>>> library(gtools)
>>>> smartbind(df2,df1)
>>>> #  a  b  d
>>>> #1 7 99 12
>>>> #2 7 99 12
>>>>
>>>>
>>>> When df1!=df2
>>>> smartbind(df1,df2)
>>>> #   a  b  d  x  y  c
>>>> #1  7 99 12 NA NA NA
>>>> #2 NA 34 88 12 44 56
>>>> A.K.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Anika Masters <anika.masters at gmail.com>
>>>> To: r-help at r-project.org
>>>> Cc:
>>>> Sent: Tuesday, February 26, 2013 1:55 PM
>>>> Subject: [R] merging or joining 2 dataframes: merge, rbind.fill, etc.?
>>>>
>>>> #I want to "merge" or "join" 2 dataframes (df1 & df2) into a 3rd
>>>> (mydf).  I want the 3rd dataframe to contain 1 row for each row in df1
>>>> & df2, and all the columns in both df1 & df2. The solution should
>>>> "work" even if the 2 dataframes are identical, and even if the 2
>>>> dataframes do not have the same column names.  The rbind.fill function
>>>> seems to work.  For learning purposes, are there other "good" ways to
>>>> solve this problem, using merge or other functions other than
>>>> rbind.fill?
>>>>
>>>> #e.g. These 3 examples all seem to "work" correctly and as I hoped:
>>>>
>>>> df1 <- data.frame(matrix(data=c(7, 99, 12) ,  nrow=1 ,  dimnames =
>>>> list( NULL ,  c('a' , 'b' , 'd') ) ) )
>>>> df2 <- data.frame(matrix(data=c(88, 34, 12, 44, 56) ,  nrow=1 ,
>>>> dimnames = list( NULL ,  c('d' , 'b' , 'x' ,  'y', 'c') ) ) )
>>>> mydf <- merge(df2, df1, all.y=T, all.x=T)
>>>> mydf
>>>>
>>>> #e.g. this works:
>>>> library(reshape)
>>>> mydf <- rbind.fill(df1, df2)
>>>> mydf
>>>>
>>>> #This works:
>>>> library(reshape)
>>>> mydf <- rbind.fill(df1, df2)
>>>> mydf
>>>>
>>>> #But this does not (the 2 dataframes are identical)
>>>> df1 <- data.frame(matrix(data=c(7, 99, 12) ,  nrow=1 ,  dimnames =
>>>> list( NULL ,  c('a' , 'b' , 'd') ) ) )
>>>> df2 <- df1
>>>> mydf <- merge(df2, df1, all.y=T, all.x=T)
>>>> mydf
>>>>
>>>> #Any way to get "mere" to work for this final example? Any other good
>>>> solutions?
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>>> guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>>> guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list