[R] Efficient way to do a merge in R

Rainer Schuermann rainer.schuermann at gmx.net
Tue Oct 4 09:40:22 CEST 2011


> Any comments are very welcome,
So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an 
expert at all:

1. I would try to be more restrictive with the columns used for merge, trying something like
m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE )

2. It may be an option to use match() directly:
indices <- match( y$V1, x$V1 )
That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that 
there is always one record in y matching one record in x. You would still need to write some code to add the 
corresponding y values to a new column in x.

3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job.

Again, no expert advice, just a few ideas!

Rgds,
Rainer


On Tuesday 04 October 2011 01:01:45 Aurélien PHILIPPOT wrote:
> Dear all,
> I am new in R and I have been faced with the following problem, that slows
> me down a lot.  I am short of ideas to circumvent it. So, any help would be
> highly appreciated:
> 
> I have 2 dataframes x and y.  x is very big (70 million observations),
> whereas y is smaller (300000 observations).
> All the observations of y are present in x. But y has one additional
> variable that I would like to incorporate to the dataframe x.
> 
> For instance, imagine they have the following variable names:
> colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2",
> "V5")
> 
> -Since the observations of y are present in x, my strategy was to merge x
> and y so that the dataframe x would get the values of the variable V5 for
> the observations that are both in x and y.
> 
> -So, I did the following:
> dat<- merge(x, y, all=TRUE).
> 
> On a small example, it works fine. The only problem is that when I apply it
> to my big dataframe x, it really take for ever (several days and not done
> yet) and I have a very  fast computer. So, I don't know whether I should
> stop now or keep on waiting.
> 
> Does anyone have any idea to perform this operation in a more efficient way
> (in terms of computation time)?
> In addition, does anyone know how to incoporate some sort of counter in a
> program to check what how much work has been done at a given point of time?
> 
> Any comments are very welcome,
> Thanks,
> 
> Best,
> Aurelien
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list