[R] predict.randomForest

Fri Dec 10 22:16:37 CET 2004

Can you show how you called randomForest() and predict.randomForest()?  The
error message came from the code:

    if (is.data.frame(x)) {
        [changing ordered factors to numeric...]
        cat.new <- sapply(x, function(x) if (is.factor(x) && 
            !is.ordered(x)) 
            length(levels(x))
        else 1)
        [...]
        if (!all(object$forest$ncat == cat.new)) 
            stop("Type of predictors in new data do not match that of the
training data.")
    }

Basically, it checks whether the numbers of categories for the predictor
variables in the newdata match those used in training (1=numeric).  If you
used the formula interface, this is unlikely to happen.  But if you are
dealing with data with huge number of variables, you should avoid the
formula interface.

For predicting large data sets, I use a loop to read the data in one chunk
at a time, run the prediction on the chunk, and iterate until finish.  Works
just fine for me.

Cheers,
Andy

> From: Tim Howard
> 
> I have a data.frame with a series of variables tagged to a binary
> response ('present'/'absent').  I am trying to use randomForest to
> predict present/absent in a second dataset.    After a lot a fiddling
> (using two data frames, making sure data types are the same, lots of
> testing with data that works such as data(iris)) I've settled on
> combining all my data into one data.frame and then subset()'ing the
> known present/absent portion of the data.frame for the 
> randomForest run
> and then using the other subset for the predict.   This 
> worked with test
> data, but then when I try it on a larger dataset (63,000 rows to
> predict), I get this error:
> 
> Error in predict.randomForest(stsw.rf, stsw.out, type = "prob") : 
>         Type of predictors in new data do not match that of the
> training data.
> 
> This is the error I was getting earlier, but I thought I had solved it
> by joining into one data.frame and subsetting.  The values for each
> variable in the 'unknown' data (that which I want to predict) fall
> within (are bound by) the values in the 'known' data.  
> 
> Does this error message have more than one meaning?
> 
> Any suggestions on how to work through this?
> 
> I am using R 2.0.1.  randomForest 4.4-2 (2004-11-02); I'm a 
> new user to
> R, but doing my best to learn as much as I can... if I'm obviously
> clueless, please forgive me!
> 
> Any help would be greatly appreciated,
> 
> Thanks in advance!
> Tim Howard
> 
> 
> More background for anyone interested:
>   CART (as well as many other statistical techniques) has 
> been used for
> a while to predict plant and animal distributions across a landscape.
> You feed it data about places where you know the Plant to 
> occur and not
> occur and CART provides you with a tree with which you can then model
> the potential distribution across your region (state, country, etc)
> using GIS.
>    I've heard good things about the randomForests and would 
> like to try
> to do the same thing. My biggest stumbling block is that I can't
> (obviously once I realized it) get a single 'best tree' from
> randomForests with which to apply my GIS models.  Or, is there any way
> to extract a formula from randomForest similar to a CART or rPart tree
> and apply it to a dataset outside of R?  The only solution I've been
> able to come up with is bring ALL of the environmental 
> variables into R,
> have randomForest do the prediction, and the get that prediction back
> into GIS. Thus my problem as I stated it above. I'm worried because my
> datasets are going to be huge (100's of millions of records) when we
> really get going. Should I be worried?
> 
> thanks,  Tim
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>