[R] merging and working with BIG data sets. Is sqldf the best way??

Gabor Grothendieck ggrothendieck at gmail.com
Fri Oct 15 05:32:37 CEST 2010


On Thu, Oct 14, 2010 at 10:56 PM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> Thanks for the suggestion and code Gabor,
>
> I've tried creating 2 indices:
>
> 1) just for the variable I intend to merge on
> 2) on the entire data set I am merging (which I think is the one I should
> be using??)
>
> However neither seemed to work. The first was still going after 2 hours,
> and the second after 12 hours, so I stopped the join.
>
> If it's not too much bother I was wondering if U could let me know which
> index I should be using?
>
>
> Or in other words since I plan to merge using POINTID do I create an index
> on
>
> system.time(sqldf("create index ai1 on A(POINTID)"))
> system.time(sqldf("create index ai2 on B(POINTID)"))
>
> or
>
> system.time(sqldf("create index ai1 on A(POINTID,alistair)"))
> system.time(sqldf("create index ai2 on B(POINTID, alistair_range)")
>
>
>
> I'm now using the following join statement
> system.time(data2 <- sqldf("select * from A natural join B"))
>

If you only ran the three sqldf statements you mentioned in your post
(thereby omitting 2 of the 5 sqldf calls in example 4i):

sqldf("create...")
sqldf("create...")
sqldf("select...")

then what you are doing is to create a database, upload your data to
it, create an index on it, destroy the database, then create a second
database, upload the data to this second database, create an second
index and then destroy that database too and then finally create a
third database, upload the data to it and then do a join without any
indexes.

You must bracket all this with empty sqldf calls as shown in 4i to
force persistence:

sqldf()
sqldf("create...")
sqldf("create...")
sqldf("select...")
sqldf()

or else put all of the sql statements in a vector to one sqldf call:

sqldf(c("create...", "create...", "select..."))

Also you can replace "select ..." with "explain query plan select
...", and it will let you know which indexes its actually using.  e.g.
in example 4i if we do that:

> sqldf("explain query plan select * from main.DF1 natural join main.DF2")
  order from                   detail
1     0    0                TABLE DF1
2     1    1 TABLE DF2 WITH INDEX ai2

we see that it really only used index ai2 and not index ai1 at all so
we could have saved the 19 seconds that it took to create ai1 as it
was never used.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list