[R] predict.randomForest

Tim Howard tghoward at gw.dec.state.ny.us
Fri Dec 10 21:31:07 CET 2004


I have a data.frame with a series of variables tagged to a binary
response ('present'/'absent').  I am trying to use randomForest to
predict present/absent in a second dataset.    After a lot a fiddling
(using two data frames, making sure data types are the same, lots of
testing with data that works such as data(iris)) I've settled on
combining all my data into one data.frame and then subset()'ing the
known present/absent portion of the data.frame for the randomForest run
and then using the other subset for the predict.   This worked with test
data, but then when I try it on a larger dataset (63,000 rows to
predict), I get this error:

Error in predict.randomForest(stsw.rf, stsw.out, type = "prob") : 
        Type of predictors in new data do not match that of the
training data.

This is the error I was getting earlier, but I thought I had solved it
by joining into one data.frame and subsetting.  The values for each
variable in the 'unknown' data (that which I want to predict) fall
within (are bound by) the values in the 'known' data.  

Does this error message have more than one meaning?

Any suggestions on how to work through this?

I am using R 2.0.1.  randomForest 4.4-2 (2004-11-02); I'm a new user to
R, but doing my best to learn as much as I can... if I'm obviously
clueless, please forgive me!

Any help would be greatly appreciated,

Thanks in advance!
Tim Howard


More background for anyone interested:
  CART (as well as many other statistical techniques) has been used for
a while to predict plant and animal distributions across a landscape.
You feed it data about places where you know the Plant to occur and not
occur and CART provides you with a tree with which you can then model
the potential distribution across your region (state, country, etc)
using GIS.
   I've heard good things about the randomForests and would like to try
to do the same thing. My biggest stumbling block is that I can't
(obviously once I realized it) get a single 'best tree' from
randomForests with which to apply my GIS models.  Or, is there any way
to extract a formula from randomForest similar to a CART or rPart tree
and apply it to a dataset outside of R?  The only solution I've been
able to come up with is bring ALL of the environmental variables into R,
have randomForest do the prediction, and the get that prediction back
into GIS. Thus my problem as I stated it above. I'm worried because my
datasets are going to be huge (100's of millions of records) when we
really get going. Should I be worried?

thanks,  Tim




More information about the R-help mailing list