[R] randomForest: predictor importance (for regressions)

Thu May 6 17:06:14 CEST 2010

From: Dimitri Liakhovitski 
> >> Andy, I'll explain why I am asking. I probably should have 
> done it in 
> >> the beginning:
> >> I am asking not in order to figure out how to do it. I am 
> asking in 
> >> order to figure something that' was done around November 01, 2008.
> >> Back then, a piece of code was run where from the object of 
> >> randomForest(.... importance=T...) the importances
> >> ($importance) were extracted (just by referring to
> >> $importance) and the first column was used.
> >> Do you happen to know what they were back then? 
> Standardized or not?
> >
> > The change coincided with the introduction of the 
> importanceSD component, due to the change in how the 
> importance is measured.  The "importance" component are just 
> mean(d[i]), and importanceSD are sd(d[i])/sqrt(ntree).  The 
> importance() function by default (scale=TRUE) does the 
> normalization, and that's what you should use.  Leo found 
> that this normalization will greatly reduce the "bias" due to 
> different number of possible splits in different predictors.
> 
> Actually, it looks like if one extracts incorrectly (by 
> looking just at $importance) - then one gets unscaled 
> results. Hope it was the same in 2008.

Yes.  The NEWS file (what you see when you type rfNews()) shows the following for version 4.3-0:

* The `importance' component of randomForest object has been changed:
  The permutation-based measures are not divided by their `standard
  errors'.  Instead, the `standard errors' are stored in the
  `importanceSD' component.  One should use the importance() extractor
  function rather than something like rf.obj$importance for extracting
  the importance measures.

and version 4.3-0 is dated 2004-07-07.

Andy

> I've just run an example randomForest for a case with 6 
> predictors (importance = T). My randomForest object is "rftrest."
> Below are some results:
> 
> Looking at importances the way it was done in November 2008:
> as.data.frame(rftest$importance)[1]
> I am getting:
> 
>  %IncMSE
> v1 1.3900833
> v2 1.2219338
> v3 0.6337521
> v4 1.4101760
> v5 1.4474130
> v6 0.7583074
> 
> Extracting as you recommended one should - looking for unscaled
> results:  importance(rftest, scale=F)
> I am getting exactly the same results as above:
> 
>      %IncMSE IncNodePurity
> v1 1.3900833     147.31267
> v2 1.2219338     147.51669
> v3 0.6337521      97.11210
> v4 1.4101760     149.48934
> v5 1.4474130     149.61458
> v6 0.7583074      97.74933
> 
> Now, I am extracting scaled importances:  importance(rftest, 
> scale=T) I am getting:
> 
>     %IncMSE IncNodePurity
> v1 16.97155     147.31267
> v2 17.04288     147.51669
> v3 10.19135      97.11210
> v4 18.22732     149.48934
> v5 18.36879     149.61458
> v6 10.46555      97.74933
> 
> This is the same as what I get when I do this the way it was done in
> 2008:  
> as.data.frame(rftest$importance)[1]/as.data.frame(rftest$importanceSD)
> Resulting in:
> 
>     %IncMSE
> v1 16.97155
> v2 17.04288
> v3 10.19135
> v4 18.22732
> v5 18.36879
> v6 10.46555
> 
> Dimitri
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}