[R] Performance tuning tips when working with wide datasets

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Nov 24 13:43:19 CET 2010


On Wed, 24 Nov 2010, Richard Vlasimsky wrote:

> Does anyone have any performance tuning tips when working with 
> datasets that are extremely wide (e.g. 20,000 columns)?

Don't use data frames.

> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast 
> machine.  The dataset data1 contains daily data going back to 1950 
> (20,000 rows) and has 25 columns.  The dataset data2 contains annual 
> data (only 60 observations), however there are lots of columns 
> (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a 
> way to speed it up.
>
> I have tried a number of different things to speed things up to no 
> avail.  I've noticed that rbinds execute much faster using matrices 
> than dataframes.  However the performance improvement when using 
> matrices (vs. data frames) on merges were negligible (8 hours down 
> to 7).  I tried casting my merge field (date) into various different 
> data types (character, factor, date).  This didn't seem to have any 
> effect. I tried the hash package, however, merge couldn't coerce the 
> class into a data.frame.  I've tried various ways to parellelize 
> computation in the past, and found that to be problematic for a 
> variety of reasons (runaway forked processes, doesn't run in a GUI 
> environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset 
> shouldn't take that long.

Correct, but that is not what you said you are doing.

Why do you need to do the merge at all?  Sounds like you simply need 
to create a suitable index into the rows of the 20,000 col data frame, 
and use that appropriately.  E.g merge data1 with a data2 containing 
the year and a column index=1:60.

date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1)
year <- 1900+as.POSIXlt(date)$year
data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25))
fdata2 <- data.frame(year=1950:2010, index=1:61)

> system.time(m <- merge(data1, fdata2, by='year', all=TRUE, sort=TRUE))
    user  system elapsed
   0.226   0.011   0.237

and use m$index to index the real data2.

We don't have an example, but at the least you are ending up with a 
data frame with 20,000 rows and ca 20,000 cols.  That's not a small 
object.

> Thanks,
> Richard
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list