[R] Large dataset + randomForest

Florian Nigsch fn211 at cam.ac.uk
Fri Jul 27 17:03:43 CEST 2007


Thanks Max,

It looks as though that did actually solve the problem! It is a bit  
of a mystery to me because I think that the 151x150 matrix can't be  
that big, unless its elements are in turn huge datastructures. (?)

I am now calling randomForest() like this:
rf <- randomForest(x=df[trainindices,-1],y=df[trainindices,1],xtest=df 
[testindices,-1],ytest=df[testindices,1], do.trace=5, ntree=500)
and it seems to be working just fine.

Thanks to all for your help,

Florian

On 26 Jul 2007, at 19:26, Kuhn, Max wrote:

> Florian,
>
> The first thing that you should change is how you call randomForest.
> Instead of specifying the model via a formula, use the randomForest(x,
> y) interface.
>
> When a formula is used, there is a terms object created so that a  
> model
> matrix can be created for these and future observations. That terms
> object can get big (I think it would be a matrix of size 151 x 150)  
> and
> is diagonal.
>
> That might not solve it, but it should help.
>
> Max
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Florian Nigsch
> Sent: Thursday, July 26, 2007 2:07 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Large dataset + randomForest
>
> [Please CC me in any replies as I am not currently subscribed to the
> list. Thanks!]
>
> Dear all,
>
> I did a bit of searching on the question of large datasets but did
> not come to a definite conclusion. What I am trying to do is the
> following: I want to read in a dataset with approx. 100 000 rows and
> approx 150 columns. The file size is ~ 33MB, which one would deem not
> too big a file for R. To speed up the reading in of the file I do not
> use read.table but a loop that does reading with scan() into a buffer
> and some preprocessing and then adds the data into a dataframe.
>
> When I then want to run randomForest() R complains that I cannot
> allocate a vector of size 313.0 MB. I am aware that randomForest
> needs all data in memory, but
> 1) why should that suddenly be 10 times the size of the data (I
> acknowedge the need for some internal data of R, but 10 times seems a
> bit too much) and
> 2) there is still physical memory free on the machine (in total 4GB
> available, even though R is limited to 2GB if I correctly remember
> the help pages - still 2GB should be enough!) - it doesn't seem to
> work either with changed settings done via mem.limits(), or run-time
> arguments --min-vsize --max-vsize - what do these have to be set to
> to work in my case??
>
>> rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
> Error: cannot allocate vector of size 313.0 Mb
>> object.size(df)/1024/1024
> [1] 129.5390
>
>
> Any help would be greatly appreciated,
>
> Florian
>
> --
> Florian Nigsch <fn211 at cam.ac.uk>
> Unilever Centre for Molecular Sciences Informatics
> Department of Chemistry
> University of Cambridge
> http://www-mitchell.ch.cam.ac.uk/
> Telephone: +44 (0)1223 763 073
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ----------------------------------------------------------------------
> LEGAL NOTICE
> Unless expressly stated otherwise, this message is confidential and  
> may be privileged.  It is intended for the addressee(s) only.   
> Access to this E-mail by anyone else is unauthorized.  If you are  
> not an addressee, any disclosure or copying of the contents of this  
> E-mail or any action taken (or not taken) in reliance on it is  
> unauthorized and may be unlawful.  If you are not an addressee,  
> please inform the sender immediately.



More information about the R-help mailing list