# [R] pseudo-R2 or GOF for regression trees?

Frank E Harrell Jr f.harrell at vanderbilt.edu
Sat May 5 22:52:25 CEST 2007

```Prof. Jeffrey Cardille wrote:
> Hello,
>
> Is there an accepted way to convey, for regression trees, something
> akin to R-squared?
>
> I'm developing regression trees for a continuous y variable and I'd
> like to say how well they are doing. In particular, I'm analyzing the
> results of a simulation model having highly non-linear behavior, and
> asking what characteristics of the inputs are related to a particular
> output measure.  I've got a very large number of points: n=4000.  I'm
> not able to do a model sensitivity analysis because of the large
> number of inputs and the model run time.
>
> I've been googling around both on the archives and on the rest of the
> web for several hours, but I'm still having trouble getting a firm
> sense of the state of the art.  Could someone help me to quickly
> understand what strategy, if any, is acceptable to say something like
> "The regression tree in Figure 3 captures 42% of the variance"?  The
> target audience is readers who will be interested in the subsequent
> verbal explanation of the relationship, but only once they are
> comfortable that the tree really does capture something.  I've run
> across methods to say how well a tree does relative to a set of trees
> on the same data, but that doesn't help much unless I'm sure the
> trees in question are really capturing the essence of the system.
>
> I'm happy to be pointed to a web site or to a thread I may have
> missed that answers this exact question.
>
> Thanks very much,
>
> Jeff
>
> ------------------------------------------
> Prof. Jeffrey Cardille
> jeffrey.cardille at umontreal.ca
> R-help at stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from
recursive partitioning.  In his examples the result was similar to using
the formula for adjusted R^2 with regression degrees of freedom equal to
about 3n/4.  You can also use something like 10-fold cross-validation
repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank

>@ARTICLE{ye98mea,
author = {Ye, Jianming},
year = 1998,
title = {On measuring and correcting the effects of data mining and model
selection},
journal = JASA,
volume = 93,
pages = {120-131},
annote = {generalized degrees of freedom;GDF;effective degrees of
freedom;data mining;model selection;model
uncertainty;overfitting;nonparametric regression;CART;simulation
setup}
}
--
Frank E Harrell Jr   Professor and Chair           School of Medicine
Department of Biostatistics   Vanderbilt University

```