[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

Mon May 24 13:45:58 CEST 2010

Dear R-help list members,

I have a statistical question regarding the Random Forest function (RF) as
applied to ecological prediction of species presences and absences.

RF seems to perform very well for prediction of species ranges or
prevalences. However, the problem with my dataset is a high degree of
spatial autocorrelation and therefore a low effective sample size compared
to the full number of gridpoints (0.5 degree grid extending over all land
areas north of 55 deg. south, ~60000 grid points). My variables are to a
high degree correlated in x and y direction. When using the entire dataset
in the RF function, the misclassification rate is unbelievably low,
suggesting overfitting. The noisy marginal probability plots (see attached
example) somehow support this idea. My question is: Is there a way to make
the decision trees in RF more generalizable without modelling the spatial
autocorrelation explicitly? Here are four ways of doing this I have thought
about:
1. Spatially clustering observations into training and test datasets and
averaging the predicted class probability values to approximate "real"
certainty - This could be done on country level or in a chessboard-like
pattern
2. Requiring a higher minimal nodesize to prevent the creation of
overfitted, maximal trees - Which value of "nodesize" might be appropriate?
3. Reducing the number of variables involved in the model by just taking one
out of a group of correlated variables (say, for example, only winter
temperature instead of temperatures from all seasons) - This variable
selection would be based on the Variable Importance plots. I was considering
to use the Gini measure ranking instead of the accuracy ranking to produce
simpler, more "biological" trees, please comment on this.
4. Requiring RF to choose only a certain number of "TRUE" and "FALSE"
("presence"-"absence") observations using the "sampsize" option, thereby
increasing the distance between the gridpoints chosen to build the model so
as to reduce correlation between observations.

Which of these pathways would you suggest to pursue? Certainly some of you
have faced and tackled the problem of spatial autocorrelation in ecological
prediction. I am aware of the works of Araujo et al. (2005) and Koenig
(1999), any further suggested reading (especially examples of how spatial
autocorrelation can be dealt with practically) would be highly welcome.

Kind regards,

Andreas Beguin
##########################################
Division of Epidemiology and Global Health
Department of Public Health and Clinical Medicine
Umea University
907 31 Umea Sweden