[R] Memory limits for MDSplot in randomForest package

Fri Mar 30 19:04:56 CEST 2012

Sam,

As you've probably seen, all the MDSplot() function does is feed 1 - proximity to the cmdscale() function.  Some suggestion and clarification:

1. If all you want is the proximity matrix, you can run randomForest() with keep.forest=FALSE to save memory.  You will likely want to run somewhat large number of trees if you're interested in proximity, and with the large number of data points, the trees are going to be quite large as well.

2. The proximity is nxn, so if you have about 19000 data points, that's a 19000 by 19000 matrix, which takes approx. 2.8GB of memory to store a copy.

3. I tried making up a 19000^2 cross-product matrix, then tried cmdscale(1-xx, k=5).  The memory usage seems to peak at around 16.3GB, but I killed it after more than two hours.  Thus I suspect it really is the eigen decomposition in cmdscale() on such a large matrix that's taking up the time.

My suggestion is to see if you can find some efficient ways of doing eigen decomposition on such large matrices.  You might be able to make the proximity matrix sparse (e.g., by thresholding), and see if there are packages that can do the decomposition on the sparse form.

Best,
Andy

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Sam Albers
> Sent: Friday, March 23, 2012 3:31 PM
> To: r-help at r-project.org
> Subject: [R] Memory limits for MDSplot in randomForest package
> 
> Hello,
> 
> I am struggling to produce an MDS plot using the randomForest package
> with a moderately large data set. My data set has one categorical
> response variables, 7 predictor variables and just under 19000
> observations. That means my proximity matrix is approximately 133000
> by 133000 which is quite large. To train a random forest on this large
> a dataset I have to use my institutions high performance computer.
> Using this setup I was able to train a randomForest with the proximity
> argument set to TRUE. At this point I wanted to construct an MDSplot
> using the following:
> 
> MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), 
> pch=as.numeric(nech.d$pd.fl))
> 
> where "nech.rf" is the randomForest object and "nech.d$pd.fl" is the
> classification factor. Now with the architecture listed below, I've
> been waiting for approximately 2 days for this to run. My issue is
> that I am not sure if this will ever run.
> 
> Can anyone recommend a way to tweak the MDSplot function to run a
> little faster? I tried changing the cmdscale arguments (i.e.
> eigenvalues) within the MDSplot function a little but that didn't seem
> to have any effect of the overall running time using a much smaller
> data set. Or even if someone could comment whether I am dreaming that
> this will actually ever run?
> 
> This is probably the best computer that I will have access to so I was
> hoping that somehow I could get this to run. I was just hoping that
> someone reading the list might have some experience with randomForests
> and using large datasets and might be able to comment on my situation.
> Below the architecture information I have constructed a dummy example
> to illustrate what I am doing but given the nature of the problem,
> this doesn't completely reflect my situation.
> 
> Any help would be much appreciated!
> 
> Thanks!
> 
> Sam
> 
> ----
> 
> Computer specs and sessionInfo()
> 
> OS: Suse Linux
> Memory: 64 GB
> Processors: Intel Itanium 2, 64 x 1500 MHz
> 
> And:
> 
> > sessionInfo()
> R version 2.6.2 (2008-02-08)
> ia64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
> TE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8
> ;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC
> _MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] randomForest_4.6-6
> 
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-17
> 
> 
> ###
> # Dummy Example
> ###
> 
> require(randomForest)
> set.seed(17)
> 
> ## Number of points
> x <- 10
> 
> df <- rbind(
> data.frame(var1=runif(x, 10, 50),
>            var2=runif(x, 2, 7),
>            var3=runif(x, 0.2, 0.35),
>            var4=runif(x, 1, 2),
>            var5=runif(x, 5, 8),
>            var6=runif(x, 1, 2),
>            var7=runif(x, 5, 8),
>            cls=factor("CLASS-2")
>            )
>   ,
> data.frame(var1=runif(x, 10, 50),
>            var2=runif(x, -3, 3),
>            var3=runif(x, 0.1, 0.25),
>            var4=runif(x, 1, 2),
>            var5=runif(x, 5, 8),
>            var6=runif(x, 1, 2),
>            var7=runif(x, 5, 8),
>            cls=factor("CLASS-1")
>            )
> 
> )
> 
> 
> df.rf<-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, 
> importance=TRUE)
> 
> MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), 
> pch=as.numeric(df$cls))
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}