[R] merging and working with BIG data sets. Is sqldf the best way??

Gabor Grothendieck ggrothendieck at gmail.com
Thu Oct 14 01:31:52 CEST 2010


On Tue, Oct 12, 2010 at 2:39 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> I’m working with some very big datasets (each dataset has 11 million rows
> and 2 columns). My first step is to merge all my individual data sets
> together (I have about 20)
>
> I’m using the following command from sqldf
>
>               data1 <- sqldf("select A.*, B.* from A inner join B
> using(ID)")
>
> But it’s taking A VERY VERY LONG TIME to merge just 2 of the datasets (well
> over 2 hours, possibly longer since it’s still going).

You need to add indexes to your tables.   See example 4i on the sqldf home page
http://sqldf.googlecode.com
This can result in huge speedups for large tables.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list