[R] randomForest speed improvements

Jonathan P Daily jdaily at usgs.gov
Mon Jan 3 21:10:55 CET 2011


Have you tried adjusting:
mtry - the number of parameters to try per tree
ntree - the number of trees grown
keep.forest - logical on whether to store tree

Specifically, I found huge improvements in speed by switching keep.forest 
to FALSE in the past when I didn't actually need the forest post analysis.
--------------------------------------
Jonathan P. Daily
Technician - USGS Leetown Science Center
11649 Leetown Road
Kearneysville WV, 25430
(304) 724-4480
"Is the room still a room when its empty? Does the room,
 the thing itself have purpose? Or do we, what's the word... imbue it."
     - Jubal Early, Firefly

r-help-bounces at r-project.org wrote on 01/03/2011 02:59:29 PM:

> [image removed] 
> 
> [R] randomForest speed improvements
> 
> apresley 
> 
> to:
> 
> r-help
> 
> 01/03/2011 03:03 PM
> 
> Sent by:
> 
> r-help-bounces at r-project.org
> 
> 
> Hi there,
> 
> We're trying to use randomForest to do some predictions.  The 
test-harness
> for our code is pretty straightforward:
> 
>   library ('randomForest');
>   data202 <- read.csv ("random.csv", header=TRUE);
>   x<- data202[1:50000,1:6];
>   y<- data202[1:50000,8];
>   y<- y[,drop=TRUE];
> 
>   x2 <- data202[50001:60000,1:6];
>   y2 <- data202[50001:60000,8];
>   y2 <- y2[,drop=TRUE];
> 
>   RFobject <- randomForest(x,y,na.action=na.roughfix);
>   p <- predict (RFobject, x2);
> 
> In this case, the CSV contains 10 columns, of which 1-6 are numeric in
> nature (day of week, week of month, etc...) and column 8 is the target
> (sales, a numeric number).
> 
> randomForest does fine with the data, our issue is how long it takes. In
> this case, about 5,000 rows of data seems to take just a few seconds, 
but
> going to 50,000 rows doesn't take 5x the time, it takes perhaps 30 or 40
> minutes.
> 
> We've downloaded and tried RT-Rank, which is a multi-threaded version of
> RandomForest, and this seems to produce the same (or slightly better)
> predictions, but also gets done fairly quickly.
> 
> What can we do to improve the speed of this data computation?  The 
system
> we're on is a dual quad-core Intel CPU @ 2.33Ghz, and with 16GB of RAM 
...
> we're using the "stock" R RPM for CentOS 5.5.
> 
> Thanks!
> 
> --
> Anthony
> -- 
> View this message in context: http://r.789695.n4.nabble.com/
> randomForest-speed-improvements-tp3172523p3172523.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list