[R] Can R handle a matrix with 8 billion entries?

Wed Aug 10 07:28:13 CEST 2011

Sorry if this is a duplicate... my email is giving me trouble this evening...

On Tue, Aug 9, 2011 at 8:38 PM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> Hi,
>
> I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
> I’m running into problems using the dist() function.
>
> I’ve been looking at a few threads about R’s memory and have read the
> memory limits section in R help. However I’m no computer expert so I’m
> hoping I’ve misunderstood something and R can handle my Big Data set,
> somehow. Although at the moment I think my dataset is simply too big and
> there is no way around it, but I’d like to be proved wrong!
>
> My data set has 90523 rows of data and 24 columns.
>
> My understanding is that this means the distance matrix has a min of
> 90523^2 elements which is 8194413529. Which roughly translates as 8GB of
> memory being required (if I assume each entry requires 1 bit). I only have
> 4GB on a 32bit build of windows and R. So there is no way that’s going to
> work.
>
> So then I thought of getting access to a more powerful computer, and maybe
> using cloud computing.
>
> However the R memory limit help mentions  “On all builds of R, the maximum
> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
> distance matrix I require has more elements than this does this mean it’s
> too big for R no matter what I do?

You have understood correctly.

>
> Any ideas would be welcome.

You have a couple options, some more involved than others. If you want
to stick with R, I would suggest using a two-step clustering approach
in which you first use k-means (assuming your distance is Euclidean)
or a modification (for example, for correlation-based distances, the
package WGCNA contains a function called projectiveKMeans) to
pre-cluster your 90k+ variables into "blocks" of about 8-10k each
(that's about as much as your computer will handle). The k-means
algorithm only requires memory storage of order n*k where k is the
number of clusters (or blocks) which can be small, say 500, and n is
the number of your variables. Then you do hierarchical clustering in
each block separately. Make sure you install and load the package
flashClust or fastCluster to make the hierarchical clustering run
reasonably fast (the stock R implementation of hclust is horribly slow
with large data sets).

The mentioned WGCNA package contains a function called
blockwiseModules that does just such a procedure, but there the
distance is based on correlations which may or may not suit your
problem.

HTH,

Peter