[R] Recursive partitioning with multicollinear variables

Frank E Harrell Jr feh3k at spamcop.net
Mon Feb 9 12:53:47 CET 2004

On Mon, 9 Feb 2004 11:24:39 +0100
"Jean-Noel" <jean-noel.candau at avignon.inra.fr> wrote:

> Dear all,
> I would like to perform a regression tree analysis on a dataset with
> multicollinear variables (as climate variables often are). The questions
> that I am asking are:
>  1- Is there any particular statistical problem in using multicollinear
> variables in a regression tree?
>  2- Multicollinear variables should appear as alternate splits. Would it
>  be
> more accurate to present these alternate splits in the results of the
> analysis or apply a variable selection or reduction procedure before the
> regression tree?
> Thank you in advance,
> Jean-Noel Candau

A more accurate and stable result would be obtained by performing a data
reduction procedure that ignores the response variable.  Combining
collinear variables into an index is often better than arbitrarily
choosing between them.  Then use the indexes in a regression model unless
you have tens of thousands of observations for recursive partitioning, or
are using bagging of trees or a related procedure to cancel out the
instability in the tree growing process [which unfortunately will often
result in an average of trees that is more complex in appearance than a
regression model].

Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list