[R] Large dataset + randomForest

Kuhn, Max Max.Kuhn at pfizer.com
Thu Jul 26 20:26:11 CEST 2007


The first thing that you should change is how you call randomForest.
Instead of specifying the model via a formula, use the randomForest(x,
y) interface.

When a formula is used, there is a terms object created so that a model
matrix can be created for these and future observations. That terms
object can get big (I think it would be a matrix of size 151 x 150) and
is diagonal. 

That might not solve it, but it should help.


-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Florian Nigsch
Sent: Thursday, July 26, 2007 2:07 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Large dataset + randomForest

[Please CC me in any replies as I am not currently subscribed to the  
list. Thanks!]

Dear all,

I did a bit of searching on the question of large datasets but did  
not come to a definite conclusion. What I am trying to do is the  
following: I want to read in a dataset with approx. 100 000 rows and  
approx 150 columns. The file size is ~ 33MB, which one would deem not  
too big a file for R. To speed up the reading in of the file I do not  
use read.table but a loop that does reading with scan() into a buffer  
and some preprocessing and then adds the data into a dataframe.

When I then want to run randomForest() R complains that I cannot  
allocate a vector of size 313.0 MB. I am aware that randomForest  
needs all data in memory, but
1) why should that suddenly be 10 times the size of the data (I  
acknowedge the need for some internal data of R, but 10 times seems a  
bit too much) and
2) there is still physical memory free on the machine (in total 4GB  
available, even though R is limited to 2GB if I correctly remember  
the help pages - still 2GB should be enough!) - it doesn't seem to  
work either with changed settings done via mem.limits(), or run-time  
arguments --min-vsize --max-vsize - what do these have to be set to  
to work in my case??

 > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb
 > object.size(df)/1024/1024
[1] 129.5390

Any help would be greatly appreciated,


Florian Nigsch <fn211 at cam.ac.uk>
Unilever Centre for Molecular Sciences Informatics
Department of Chemistry
University of Cambridge
Telephone: +44 (0)1223 763 073

	[[alternative HTML version deleted]]

R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}

More information about the R-help mailing list