[BioC] tapply for enormous (>2^31 row) matrices

Wed Feb 22 01:10:02 CET 2012

Hi,

On Tue, Feb 21, 2012 at 6:11 PM, Matthew Keller <mckellercran at gmail.com> wrote:
> Hello all,
>
> I just sent this to the main R forum, but realized this audience might
> have more familiarity with this type of problem...

If you're determined to do this in R, I'd split your file into a few
smaller ones (you can even use the *nix `split` command), do your
group-by-and-summarize on the smaller files and in different R
processes, then summarize your summaries (sounds like a job for
hadoop, no?)

For your `tapply` functionality, I'd look to the data.table package --
it has super-faset group-by mojo, and tries to be as memory efficient
as possible.

Assuming you can get your (subset) of data into a data.frame `df` and
that your column names are something like, c("ID1", "ID2", "XX",
"score"), you'd then:

R> library(data.table)
R> df <- as.data.table(df) ## makes a copy
R> setkeyv(df, c("ID1", "ID2")) ## no copy
R> ans <- df[, list(shared=sum(score)), by=key(df)]

Summarizing the results from separate processes will be trivial.
Loading your data into a data.frame to start with, however, will
likely take painfully long.

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact