[R] Can't seem to finish a randomForest.... Just goes and goe s!

Tue Apr 6 04:38:13 CEST 2004

Fortunately, I'm not interested in any of the explanatory uses of model
building, just getting a useful prediction. I have a lot of "explaining" on
my day job. For this, I just want to be right, don't care about the why. I
have thought of on alternative approach, not sure of the implications. I
could run each "subject" separately, and predict each separately. Get a
forest for one group, to predict this organization's behavior given its long
history. Repeat for each other. Then just compare the probabilities of
reaching the desired state. In a sense, that's kind of the goal anyway...
But since none of these groups operate in a vacuum, seems like it would be
nice to get the bigger picture in there somehow. Not sure if this
alternative is really workable.

Thanks for all of your hard work, Andy (and for your input Bill and
Torsten). It really is a fascinating approach.

Looks like to modify that bootstrapping, I'd  have to go into the Fortran
code, then? Hmmmm. Talk about being out of one's league!

On 4/5/04 21:15, "Liaw, Andy" <andy_liaw at merck.com> wrote:

> If that variable is a subject ID, and the data are repeated observations on
> the subjects, then you might be treading on thin ice here.  A while back
> someone at NCI got a data set with two reps per subject, and he was able to
> modify the code so that the bootstrap is done on the subject basis, rather
> than observations.  It's a bit of work trying to get a proximity matrix to
> make sense, though.
> 
> I really have no idea how to take care of repeated measures type data (i.e.,
> accounting for intra-subject correlations) in a classification problem.  I
> suppose one can formulate it as a GLMM.  I guess it really depends on what
> you are looking for; i.e., what's the goal?  I assume you want to predict
> something, but is that over all subjects, or subject-specific?  I better
> stop here, as this is out of my league...
> 
> Andy
> 
>> From: David L. Van Brunt, Ph.D. [mailto:dvanbrunt at well-wired.com]
>> 
>> Removing that first 39 level variable, the trees ran just
>> fine. I had also
>> taken the shorter categoricals (day of week, for example) and
>> read them in
>> as numerics.
>> 
>> Still working on it. Need that 30 level puppy in there somehow, but it
>> really is not anything like a rank... It is a nominal variable.
>> 
>> With numeric values, only assigning the outcome (last column)
>> to be a factor
>> using "as.factor()" it runs fine, and fast.
>> 
>> I may be misusing this analysis. That first column is indeed
>> nominal, and I
>> was including it because the data within that name are
>> repeated observations
>> of that subject. But I suppose there's no guarantee that that
>> information
>> would be selected, so what does that do to the forest?  Sigh.
>> I'm not much
>> of a lumberjack. Logistic regression is more my style, but
>> this is pretty
>> interesting stuff.
>> 
>> If interested, here's a link to the data;
>> http://www.well-wired.com/reflibrary/uploads/1081216314.txt
>> 
>>  
>> 
>> On 4/5/04 1:40, "Bill.Venables at csiro.au"
>> <Bill.Venables at csiro.au> wrote:
>> 
>>> Alternatively, if you can arrive at a sensible ordering of
>> the levels
>>> you can declare them ordered factors and make the
>> computation feasible
>>> once again.
>>> 
>>> Bill Venables.
>>> 
>>> -----Original Message-----
>>> From: r-help-bounces at stat.math.ethz.ch
>>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
>> Torsten Hothorn
>>> Sent: Monday, 5 April 2004 4:27 PM
>>> To: David L. Van Brunt, Ph.D.
>>> Cc: R-Help
>>> Subject: Re: [R] Can't seem to finish a randomForest....
>> Just goes and
>>> goes!
>>> 
>>> 
>>> On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:
>>> 
>>>> Playing with randomForest, samples run fine. But on real
>> data, no go.
>>>> 
>>>> Here's the setup: OS X, same behavior whether I'm using
>> R-Aqua 1.8.1
>>>> or the Fink compile-of-my-own with X-11, R version 1.8.1.
>>>> 
>>>> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical
>>>> RAM.
>>>> 
>>>> I have not altered the Startup options of R.
>>>> 
>>>> Data set is read in from a text file with "read.table", and has 46
>>>> variables and 1,855 cases. Trying the following:
>>>> 
>>>> The DV is categorical, 0 or 1. Most of the IV's are either
>> continuous,
>>> 
>>>> or correctly read in as factors. The largest factor has 30
>> levels....
>>>> Only the
>>>                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> 
>>> This means: there are 2^(30-1) = 536.870.912 possible splits to be
>>> evaluated everytime this variable is picked up (minus
>> something due to
>>> empty levels). At least the last time I looked at the code,
>> randomForest
>>> used an exhaustive search over all possible splits. Try reducing the
>>> number of levels to something reasonable (or for a first
>> shot: remove
>>> this variable from the learning sample).
>>> 
>>> Best,
>>> 
>>> Torsten
>>> 
>>> 
>>>> DV seems to need identifying as a factor to force class trees over
>>>> regresssion:
>>>> 
>>>>> Mydata$V46<-as.factor(Mydata$V46)
>>>>> 
>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi
>>>>> ties=FALSE
>>>> , importance=FALSE)
>>>> 
>>>> 5 hours later, R.bin was still taking up 75% of my processor.  When
>>>> I've tried this with larger data, I get errors referring
>> to the buffer
>>> 
>>>> (sorry, not in front of me right now).
>>>> 
>>>> Any ideas on this? The data don't seem horrifically large.
>> Seems like
>>>> there are a few options for setting memory size, but I'm  not sure
>>>> which of them to try tweaking, or if that's even the issue.
>>>> 
>>>> ______________________________________________
>>>> R-help at stat.math.ethz.ch mailing list
>>>> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide!
>>>> http://www.R-project.org/posting-guide.html
>>>> 
>>>> 
>>> 
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide!
>>> http://www.R-project.org/posting-guide.html
>> 
>> -- 
>> David L. Van Brunt, Ph.D.
>> Outlier Consulting & Development
>> mailto: <ocd at well-wired.com>
>> 
>> 
>> 
>> 
> 
> 
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachments,...{{dropped}}
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
David L. Van Brunt, Ph.D.
Outlier Consulting & Development
mailto: <ocd at well-wired.com>