[R] Efficient way to do a merge in R

Tue Oct 4 10:04:40 CEST 2011

On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann
<rainer.schuermann at gmx.net> wrote:
>> Any comments are very welcome,
> So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an
> expert at all:
>
> 1. I would try to be more restrictive with the columns used for merge, trying something like
> m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE )
>
> 2. It may be an option to use match() directly:
> indices <- match( y$V1, x$V1 )
> That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that
> there is always one record in y matching one record in x. You would still need to write some code to add the
> corresponding y values to a new column in x.

I think this idea is a good one (though even match could be slow with
70 million observations).  I believe related to the extraction and
assignment methods for data frames, some extra copies of data end up
being made (at least this is my understanding, experts may correct
me), so I would consider possibly using a list (you lose the builtin
data frame checking that all variables are of the same length (same
number of rows), but I think it makes it faster to work with.  If you
know the indices in x where the y values should go and the class of y
(say numeric) then:
tmp <- vector("numeric", 70000000)
tmp[indices] <- y$V5
x$V5 <- tmp
rm(tmp)
gc()
and you're done.  Takes less than a minute to run on my little laptop
(8GB RAM, 1.6ghz dual core, only slightly faster than a netbook).

>
> 3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job.

Not a bad idea for working with large datasets either.

>
> Again, no expert advice, just a few ideas!
>
> Rgds,
> Rainer
>
>
> On Tuesday 04 October 2011 01:01:45 Aurélien PHILIPPOT wrote:
>> Dear all,
>> I am new in R and I have been faced with the following problem, that slows
>> me down a lot.  I am short of ideas to circumvent it. So, any help would be
>> highly appreciated:
>>
>> I have 2 dataframes x and y.  x is very big (70 million observations),
>> whereas y is smaller (300000 observations).
>> All the observations of y are present in x. But y has one additional
>> variable that I would like to incorporate to the dataframe x.
>>
>> For instance, imagine they have the following variable names:
>> colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2",
>> "V5")
>>
>> -Since the observations of y are present in x, my strategy was to merge x
>> and y so that the dataframe x would get the values of the variable V5 for
>> the observations that are both in x and y.
>>
>> -So, I did the following:
>> dat<- merge(x, y, all=TRUE).
>>
>> On a small example, it works fine. The only problem is that when I apply it
>> to my big dataframe x, it really take for ever (several days and not done
>> yet) and I have a very  fast computer. So, I don't know whether I should
>> stop now or keep on waiting.
>>
>> Does anyone have any idea to perform this operation in a more efficient way
>> (in terms of computation time)?
>> In addition, does anyone know how to incoporate some sort of counter in a
>> program to check what how much work has been done at a given point of time?
>>
>> Any comments are very welcome,
>> Thanks,
>>
>> Best,
>> Aurelien
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/