[R] randomForest memory footprint

John Foreman john.4man at gmail.com
Thu Sep 8 15:24:41 CEST 2011


I set maxnodes and nodesize to reasonable levels and everything is
working great now. Thanks for the guidance.

v/r,
John

On Thu, Sep 8, 2011 at 8:58 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
> It looks like you are building a regression model.  With such a large number of rows, you should try to limit the size of the trees by setting nodesize to something larger than the default (5).  The issue, I suspect, is the fact that the size of the largest possible tree has about 2*nodesize nodes, and each node takes a row in a matrix to store.  Multiply that by the number of trees you are trying to build, and you see how the memory can be gobbled up quickly.  Boosted trees don't usually run into this problem because one usually boost very small trees (usually no more than 10 terminal nodes per tree).
>
> Best,
> Andy
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of John Foreman
>> Sent: Wednesday, September 07, 2011 2:46 PM
>> To: r-help at r-project.org
>> Subject: [R] randomForest memory footprint
>>
>> Hello, I am attempting to train a random forest model using the
>> randomForest package on 500,000 rows and 8 columns (7 predictors, 1
>> response). The data set is the first block of data from the UCI
>> Machine Learning Repo dataset "Record Linkage Comparison Patterns"
>> with the slight modification that I dropped two columns with lots of
>> NA's and I used knn imputation to fill in other gaps.
>>
>> When I load in my dataset, R uses no more than 100 megs of RAM. I'm
>> running a 64-bit R with ~4 gigs of RAM available. When I execute the
>> randomForest() function, however I get memory complaints. Example:
>>
>> > summary(mydata1.clean[,3:10])
>>   cmp_fname_c1     cmp_lname_c1       cmp_sex           cmp_bd
>>   cmp_bm           cmp_by          cmp_plz         is_match
>>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
>> Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   FALSE:572820
>>  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.0000   1st Qu.:0.0000
>> 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   TRUE :  2093
>>  Median :1.0000   Median :0.1818   Median :1.0000   Median :0.0000
>> Median :0.0000   Median :0.0000   Median :0.00000
>>  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
>> Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
>>  3rd Qu.:1.0000   3rd Qu.:0.4286   3rd Qu.:1.0000   3rd Qu.:0.0000
>> 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000
>>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
>> Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
>> > mydata1.rf.model2 <- randomForest(x =
>> mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
>> Error: cannot allocate vector of size 877.2 Mb
>> In addition: Warning messages:
>> 1: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 2: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 3: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 4: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>>
>> Other techniques such as boosted trees handle the data size just fine.
>> Are there any parameters I can adjust such that I can use a value of
>> 100 or more for ntree?
>>
>> Thanks,
>> John
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>



More information about the R-help mailing list