[R] Recursive partitioning with multicollinear variables

Bill.Venables@csiro.au Bill.Venables at csiro.au
Mon Feb 9 12:32:50 CET 2004

No, for regression trees collinearity is a non-issue, because it is not a linear procedure.  Having variables that are linearly dependent (even exactly so) merely widens the scope of choice that the algorithm has to make cuts.

I'm not sure what you mean by "Multicollinear variables should appear as alternate splits".  Do you mean that every second split should be in one variable of a particular set?  Perhaps you mean "alternative" instead of "alternate"?  In either case I think you are worrying over nothing.  Just go ahead and do the tree-based model analysis and don't worry about it.

Here is a little picture that might clarify things.  Suppose Latitude and Longitude are two variables on which the algorithm may choose to split.  This means that splits in these geographical variables can only occur in a North-South or an East-West direction.  Let's suppose you add in two extra variables that are completely dependent on the first, namely

	LatPlusLong <- Latitude + Longitude
	LatMinusLong <- Latitude - Longitude

and now offer all four variables as potential split variables.  Now the algorithm may split North-South, East-West, NorthEast-SouthWest or NorthWest-SouthEast.  All you have done is increase the scope of choice for the algorithm to make splits.  Not only does the linear dependence not matter, but I'd argue it could be a pretty good thing.

One serious message to take from this as well, though, is to use regression trees for prediction.  Don't read too much into the variables that the algorithm has chosen to use at any stage.

Bill Venables.

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Jean-Noel
Sent: Monday, 9 February 2004 8:25 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Recursive partitioning with multicollinear variables

Dear all,
I would like to perform a regression tree analysis on a dataset with multicollinear variables (as climate variables often are). The questions that I am asking are:
 1- Is there any particular statistical problem in using multicollinear variables in a regression tree?
 2- Multicollinear variables should appear as alternate splits. Would it be more accurate to present these alternate splits in the results of the analysis or apply a variable selection or reduction procedure before the regression tree? Thank you in advance,

Jean-Noel Candau

INRA - Unité de Recherches Forestières Méditerranéennes
Avenue A. Vivaldi
Tel: (33) 4 90 13 59 22
Fax: (33) 4 90 13 59 59

R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

More information about the R-help mailing list