[R] Can't seem to finish a randomForest.... Just goes and goe s!
Liaw, Andy
andy_liaw at merck.com
Tue Apr 6 04:15:21 CEST 2004
If that variable is a subject ID, and the data are repeated observations on
the subjects, then you might be treading on thin ice here. A while back
someone at NCI got a data set with two reps per subject, and he was able to
modify the code so that the bootstrap is done on the subject basis, rather
than observations. It's a bit of work trying to get a proximity matrix to
make sense, though.
I really have no idea how to take care of repeated measures type data (i.e.,
accounting for intra-subject correlations) in a classification problem. I
suppose one can formulate it as a GLMM. I guess it really depends on what
you are looking for; i.e., what's the goal? I assume you want to predict
something, but is that over all subjects, or subject-specific? I better
stop here, as this is out of my league...
Andy
> From: David L. Van Brunt, Ph.D. [mailto:dvanbrunt at well-wired.com]
>
> Removing that first 39 level variable, the trees ran just
> fine. I had also
> taken the shorter categoricals (day of week, for example) and
> read them in
> as numerics.
>
> Still working on it. Need that 30 level puppy in there somehow, but it
> really is not anything like a rank... It is a nominal variable.
>
> With numeric values, only assigning the outcome (last column)
> to be a factor
> using "as.factor()" it runs fine, and fast.
>
> I may be misusing this analysis. That first column is indeed
> nominal, and I
> was including it because the data within that name are
> repeated observations
> of that subject. But I suppose there's no guarantee that that
> information
> would be selected, so what does that do to the forest? Sigh.
> I'm not much
> of a lumberjack. Logistic regression is more my style, but
> this is pretty
> interesting stuff.
>
> If interested, here's a link to the data;
> http://www.well-wired.com/reflibrary/uploads/1081216314.txt
>
>
>
> On 4/5/04 1:40, "Bill.Venables at csiro.au"
> <Bill.Venables at csiro.au> wrote:
>
> > Alternatively, if you can arrive at a sensible ordering of
> the levels
> > you can declare them ordered factors and make the
> computation feasible
> > once again.
> >
> > Bill Venables.
> >
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> Torsten Hothorn
> > Sent: Monday, 5 April 2004 4:27 PM
> > To: David L. Van Brunt, Ph.D.
> > Cc: R-Help
> > Subject: Re: [R] Can't seem to finish a randomForest....
> Just goes and
> > goes!
> >
> >
> > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:
> >
> >> Playing with randomForest, samples run fine. But on real
> data, no go.
> >>
> >> Here's the setup: OS X, same behavior whether I'm using
> R-Aqua 1.8.1
> >> or the Fink compile-of-my-own with X-11, R version 1.8.1.
> >>
> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical
> >> RAM.
> >>
> >> I have not altered the Startup options of R.
> >>
> >> Data set is read in from a text file with "read.table", and has 46
> >> variables and 1,855 cases. Trying the following:
> >>
> >> The DV is categorical, 0 or 1. Most of the IV's are either
> continuous,
> >
> >> or correctly read in as factors. The largest factor has 30
> levels....
> >> Only the
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > This means: there are 2^(30-1) = 536.870.912 possible splits to be
> > evaluated everytime this variable is picked up (minus
> something due to
> > empty levels). At least the last time I looked at the code,
> randomForest
> > used an exhaustive search over all possible splits. Try reducing the
> > number of levels to something reasonable (or for a first
> shot: remove
> > this variable from the learning sample).
> >
> > Best,
> >
> > Torsten
> >
> >
> >> DV seems to need identifying as a factor to force class trees over
> >> regresssion:
> >>
> >>> Mydata$V46<-as.factor(Mydata$V46)
> >>>
> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi
> >>> ties=FALSE
> >> , importance=FALSE)
> >>
> >> 5 hours later, R.bin was still taking up 75% of my processor. When
> >> I've tried this with larger data, I get errors referring
> to the buffer
> >
> >> (sorry, not in front of me right now).
> >>
> >> Any ideas on this? The data don't seem horrifically large.
> Seems like
> >> there are a few options for setting memory size, but I'm not sure
> >> which of them to try tweaking, or if that's even the issue.
> >>
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide!
> >> http://www.R-project.org/posting-guide.html
> >>
> >>
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
>
> --
> David L. Van Brunt, Ph.D.
> Outlier Consulting & Development
> mailto: <ocd at well-wired.com>
>
>
>
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}
More information about the R-help
mailing list