[R] sample size > 20K? Was: fitness of regression tree: how to measure???

Thu Apr 1 23:23:13 CEST 2010

The discussion of Leo Breiman's paper in Statistical Science: Statistical Modeling - The Two cultures, is a must read for all statisticians doing prediction modeling.  Especially see the exchange between Cox and Breiman (I call this the Cox-Breiman duel).

Ravi.

____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu

----- Original Message -----
From: Bert Gunter <gunter.berton at gene.com>
Date: Thursday, April 1, 2010 12:55 pm
Subject: Re: [R] sample size > 20K? Was: fitness of regression tree: how to measure???
To: 'Frank E Harrell Jr' <f.harrell at vanderbilt.edu>, 'vibha patel' <vibhapatelddu at gmail.com>
Cc: r-help at r-project.org

> Since Frank has made this somewhat cryptic remark (sample size > 20K)
> several times now, perhaps I can add a few words of (what I hope is) further
> clarification.
> 
> Despite any claims to the contrary, **all** statistical (i.e. empirical)
> modeling procedures are just data interpolators: that is, all that 
> they can
> claim to do is produce reasonable predictions of what may be expected 
> within
> the extent of the data. The quality of the model is judged by the goodness
> of fit/prediction over this extent. Ergo the standard textbook caveats 
> about
> the dangers of extrapolation when using fitted models for prediction. 
> Note,
> btw, the contrast to "mechanistic" models, which typically **are** assessed
> by how well they **extrapolate** beyond current data. For example, Newton's
> apple to the planets. They are often "validated" by their ability to "work"
> in circumstances (or scales) much different than those from which they 
> were
> derived.
> 
> So statistical models are just fancy "prediction engines." In particular,
> there is no guarantee that they provide any meaningful assessment of
> variable importance: how predictors causally relate to the response.
> Obviously, empirical modeling can often be useful for this purpose,
> especially in well-designed studies and experiments, but there's no
> guarantee: it's an "accidental" byproduct of effective prediction.
> 
> This is particularly true for happenstance (un-designed) data and
> non-parametric models like regression/classification trees. Typically, 
> there
> are many alternative models (trees) that give essentially the same quality
> of prediction. You can see this empirically by removing a modest random
> subset of the data and re-fitting. You should not be surprised to see 
> the
> fitted model -- the tree topology -- change quite radically. HOWEVER, 
> the
> predictions of the models within the extent of the data will be quite
> similar to the original results. Frank's point is that unless the data 
> set
> is quite large and the predictive relationships quite strong -- which
> usually implies parsimony -- this is exactly what one should expect. 
> Thus it
> is critical not to over-interpret the particular model one get, i.e. to
> infer causality from the model (tree)structure.
> 
> Incidentally, there is nothing new or radical in this; indeed, John Tukey,
> Leo Breiman, George Box, and others wrote eloquently about this 
> decades ago.
> And Breiman's random forest modeling procedure explicitly abandoned efforts
> to build simply interpretable models (from which one might infer causality)
> in favor of building better interpolators, although assessment of "variable
> importance" does try to recover some of that interpretability 
> (however, no
> guarantees are given).
> 
> HTH. And contrary views welcome, as always.
> 
> Cheers to all,
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
>  
>  
> -----Original Message-----
> From: r-help-bounces at r-project.org [ On
> Behalf Of Frank E Harrell Jr
> Sent: Thursday, April 01, 2010 5:02 AM
> To: vibha patel
> Cc: r-help at r-project.org
> Subject: Re: [R] fitness of regression tree: how to measure???
> 
> vibha patel wrote:
> > Hello,
> > 
> > I'm using rpart function for creating regression trees.
> > now how to measure the fitness of regression tree???
> > 
> > thanks n Regards,
> > Vibha
> 
> If the sample size is less than 20,000, assume that the tree is a 
> somewhat arbitrary representation of the relationships in the data and 
> 
> that the form of the tree will not replicate in future datasets.
> 
> Frank
> 
> -- 
> Frank E Harrell Jr   Professor and Chairman        School of Medicine
>                       Department of Biostatistics   Vanderbilt University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> 
> PLEASE do read the posting guide 
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> 
> PLEASE do read the posting guide 
> and provide commented, minimal, self-contained, reproducible code.