[R] Running randomForests on large datasets

Max Kuhn mxkuhn at gmail.com
Wed Feb 27 21:23:32 CET 2008


Also, use the non-formula interface to the function:

  # saves some space
  randomForest(x, y)

the formula interface:

  # avoid:
  randomForest(y~., data = something)

This second method saves a terms object that is very sparse and takes
up a lot of space.

Max

On Wed, Feb 27, 2008 at 12:31 PM, Nagu <thogiti at gmail.com> wrote:
> Thank you Andy.
>
>  It is throwing memory allocation error for me for numerous
>  combinations of ntree and nodesize values. I tried with memory.limit()
>  and memory.size to use the maximum memory but the error was
>  consistent. But one thing I noticed was that I had tough time even
>  just loading the dataset previously. I, then, used Rcmdr library to
>  load the same data, and it was faster than just loading with the R
>  console and it didn't throw any memory errors like it used to throw
>  previously, now and then. I thought that may be this was a fluke with
>  Rcmdr, I, then, opened it a few more times and every time Rcmdr was
>  consistent in loading the large dataset without any allocation errors.
>  I also tried with opening a few other programs on the desktop,
>  repeated the process, it loaded just fine.
>
>  Any ideas on how Rcmdr is loading the file as opposed to R console (I
>  am using read.table())?
>
>  Anyway, I thought I'd share this observation with the others. Thank
>  you Andy for your ideas. I'll keep tinkering with the parameters.
>
>  Thank you,
>  Nagu
>
>
>
>  On Wed, Feb 27, 2008 at 5:24 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
>  > There are a couple of things you may want to try, if you can load the
>  >  data into R and still have enough to spare:
>  >
>  >  - Run randomForest() with fewer trees, say 10 to start with.
>  >
>  >  - Run randomForest() with nodesize set to something larger than the
>  >  default (5 for classification).  This puts a limit on the size of the
>  >  trees being grown.  Try something like 21 and see if that runs, and
>  >  adjust accordingly.
>  >
>  >  HTH,
>  >  Andy
>  >
>  >
>  >  From: Nagu
>  >
>  >
>  >
>  >  > Hi,
>  >  >
>  >  > I am trying to run randomForests on a datasets of size 500000X650 and
>  >  > R pops up memory allocation error. Are there any better ways to deal
>  >  > with large datasets in R, for example, Splus had something like
>  >  > bigData library.
>  >  >
>  >  > Thank you,
>  >  > Nagu
>  >  >
>  >
>  > > ______________________________________________
>  >  > R-help at r-project.org mailing list
>  >  > https://stat.ethz.ch/mailman/listinfo/r-help
>  >  > PLEASE do read the posting guide
>  >  > http://www.R-project.org/posting-guide.html
>  >  > and provide commented, minimal, self-contained, reproducible code.
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >  ------------------------------------------------------------------------------
>  >  Notice:  This e-mail message, together with any attachments, contains
>  >  information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>  >  New Jersey, USA 08889), and/or its affiliates (which may be known
>  >  outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
>  >  and in Japan, as Banyu - direct contact information for affiliates is
>  >  available at http://www.merck.com/contact/contacts.html) that may be
>  >  confidential, proprietary copyrighted and/or legally privileged. It is
>  >  intended solely for the use of the individual or entity named on this
>  >  message. If you are not the intended recipient, and have received this
>  >  message in error, please notify us immediately by reply e-mail and then
>  >  delete it from your system.
>  >
>  >  ------------------------------------------------------------------------------
>
>
> >
>
>  ______________________________________________
>  R-help at r-project.org mailing list
>  https://stat.ethz.ch/mailman/listinfo/r-help
>  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>  and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max



More information about the R-help mailing list