[R] Performance tuning tips when working with wide datasets

Wed Nov 24 17:15:37 CET 2010

Richard,

Try data.table. See the introduction vignette and the
presentations e.g. there is a slide showing a join to
183,000,000 observations of daily stock prices in
0.002 seconds.

data.table has fast rolling joins (i.e. fast last observation
carried forward) too. I see you asked about that on
this list on 8 Nov. Also see fast aggregations using 'by'
on a key()-ed in-memory table.

I wonder if your 20,000 columns are always
populated for all rows. If not then consider collapsing
to a 3 column table (row,col,data) and then
joining to that. You may have that format in your
original data source anyway, so you may be able
to skip a step you may have implemented already
which expands that format to wide. In other words,
keeping it narrow may be an option (like how a sparse
matrix is stored).

Matthew

http://datatable.r-forge.r-project.org/

"Richard Vlasimsky" <richard.vlasimsky at imidex.com> wrote in message 
news:2E042129-4430-4C66-9308-A36B761EBBEB at imidex.com...
>
> Does anyone have any performance tuning tips when working with datasets 
> that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2, 
> by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine. 
> The dataset data1 contains daily data going back to 1950 (20,000 rows) and 
> has 25 columns.  The dataset data2 contains annual data (only 60 
> observations), however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to 
> speed it up.
>
> I have tried  a number of different things to speed things up to no avail. 
> I've noticed that rbinds execute much faster using matrices than 
> dataframes.  However the performance improvement when using matrices (vs. 
> data frames) on merges were negligible (8 hours down to 7).  I tried 
> casting my merge field (date) into various different data types 
> (character, factor, date).  This didn't seem to have any effect. I tried 
> the hash package, however, merge couldn't coerce the class into a 
> data.frame.  I've tried various ways to parellelize computation in the 
> past, and found that to be problematic for a variety of reasons (runaway 
> forked processes, doesn't run in a GUI environment, doesn't run on Macs, 
> etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset 
> shouldn't take that long.
>
> Thanks,
> Richard