[R] randomForest: predictor importance (for regressions)

Thu May 6 15:40:40 CEST 2010

Not that I want to pick on you, but can you turn off the html format in
your messages?  The mailing list balk at such format, and I can't reply
in plain text with the right formatting of previous messages (had to
manually remove the tabbed indents that Outlook added when changed to
plain text).

See reply inline below.

Andy 

From: Dimitri Liakhovitski [mailto:ld7631 at gmail.com] 

Andy, thank you - and sorry for being a bit slow (see my questions
below):

On Thu, May 6, 2010 at 8:37 AM, Liaw, Andy <andy_liaw at merck.com> wrote:

See reply inline below.

Andy

From: Dimitri Liakhovitski

>
> I have a question about predictor importances in randomForest.
>
> Once I've run randomForest and got my object, I get their importances:
> rfresult$importance
> I also get the "standard errors" of the permutation-based importance
> measure: rfresult$importanceSD
>
> I have 2 questions:
>
> 1. Because I am dealing with regressions, I am getting an
> importance object
> (rfresult$importance) with two columns, labeled "%IncMSE"
> (the first column)
> and "IncNodePurity" (the second column). I assume it's the
> first one that is
> the mean decrease in accuracy due to permutation. Am I correct or am I
> wrong? I am confused because ?randomForest says: "or
> Regression, the first
> column is the mean decrease in accuracy and the second the
> mean decrease in
> MSE." - but it is the first column, not the second that has
> "MSE" in its
> header.

In regression trees, node impurity is measured by MSE, therefore the
second measure that averages cumulative reduction in node impurity due
to splits by a variable over all trees is labelled as "mean decrease in
MSE".

Andy, but it is the FIRST column in $importance (not the SECOND) that is
labeled "%IncMSE". The second column is labeled "IncNodePurity". So, I
am confused - which one is the mean decrease in accuracy?
Or, maybe I should ask again: In a case of regression trees, which of
the two columns in $importance contains the predictor importances
calculated by randomly permuting values and looking at how much worse
the prediction has become?
I assume it's the first column (labeled "%IncMSE"). Is this correct? 

[AL]: Note I said "reduction in node impurity", which is another way of
saying "increase in node purity" 8-).  I should think from the help page
for importance() it should be clear which is which.  When you permute
the value of a variable in OOB data and make prediction, the expectation
is that the MSE will increase, especially if the variable has some
importance, thus the label "%IncMSE".  Why do you need to assume?	

> 2. According to this thread (
> http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg94873.
> html), The
> overall importance measure is mean(d[i]) / se(d[i]), where se(d[i]) is
> sd(d[i])/sqrt(ntree) (the "standard error").
> So, in order to get at the importance of predictors (and I
> want to use the
> permutation-based importance) - should I just take the first column of
> rfresult$importance or should I first divide rfresult$importance by
> rfresult$importanceSD - to get something analogous to z-scores and use
> those?

See the "scale" argument in ?importance.  The recommended way of
extracting components of an object in R is to use the extractor
functions when they exist.

Andy, I've run randomForest (for regression) and just wrote: Importance
= TRUE. Now, I am just looking at $importance (without specifying
anything at all, not scale either). So, if I do it that way - then to
get the standardized permutation-based importances, should I divide the
first column of $importance by $importanceSD - or has it been done by
default so that the first column of $importance already contains the
standardized importances?

[AL]: As I said, you are recommended to use importance() to extract
variable importance.  The recommendation is for avoiding confusions like
yours.  If you want to know what the components in the objects give you,
compare to what the extractor function returns, you can look inside the
extractor function to find out for yourself.  Really, I'm not trying to
be difficult, but there are very good reasons for not accessing the
components directly when extractor functions exist.  If the underlying
components are somehow changed in the future, only the extractor
functions are guaranteed to give you the "right thing".  I added the
extractor function for importance measures precisely because the way
they are computed changed.

	Thank you!
	Dimitri

		> Thank you very much!
		>
		> --
		> Dimitri Liakhovitski
		> Ninah.com
		> Dimitri.Liakhovitski at ninah.com
		>

		>       [[alternative HTML version deleted]]
		>
		> ______________________________________________
		> R-help at r-project.org mailing list
		> https://stat.ethz.ch/mailman/listinfo/r-help
		> PLEASE do read the posting guide
		> http://www.R-project.org/posting-guide.html
		> and provide commented, minimal, self-contained,
reproducible code.
		>
		> ______________________________________________
		> R-help at r-project.org mailing list
		> https://stat.ethz.ch/mailman/listinfo/r-help
		> PLEASE do read the posting guide
		> http://www.R-project.org/posting-guide.html
		> and provide commented, minimal, self-contained,
reproducible code.
		>
		Notice:  This e-mail message, together with any
attachments, contains
		information of Merck & Co., Inc. (One Merck Drive,
Whitehouse Station,
		New Jersey, USA 08889), and/or its affiliates Direct
contact information
		for affiliates is available at
		http://www.merck.com/contact/contacts.html) that may be
confidential,
		proprietary copyrighted and/or legally privileged. It is
intended solely
		for the use of the individual or entity named on this
message. If you are
		not the intended recipient, and have received this
message in error,
		please notify us immediately by reply e-mail and then
delete it from
		your system.

	-- 
	Dimitri Liakhovitski
	Ninah.com
	Dimitri.Liakhovitski at ninah.com

Notice:  This e-mail message, together with any attachme...{{dropped:11}}