[R] sample size > 20K? Was: fitness of regression tree: how to measure???

Thu Apr 1 21:32:54 CEST 2010

Good comments Bert.  Just 2 points to add: People rely a lot on the tree 
structure found by recursive partitioning, so the structure needs to be 
stable.  This requires a huge samples size.  Second, recursive 
partitioning is not competitive with other methods in terms of 
predictive descrimination unless the sample size is so large that the 
tree doesn't need to be pruned upon cross-validation.

Frank

Bert Gunter wrote:
> Since Frank has made this somewhat cryptic remark (sample size > 20K)
> several times now, perhaps I can add a few words of (what I hope is) further
> clarification.
> 
> Despite any claims to the contrary, **all** statistical (i.e. empirical)
> modeling procedures are just data interpolators: that is, all that they can
> claim to do is produce reasonable predictions of what may be expected within
> the extent of the data. The quality of the model is judged by the goodness
> of fit/prediction over this extent. Ergo the standard textbook caveats about
> the dangers of extrapolation when using fitted models for prediction. Note,
> btw, the contrast to "mechanistic" models, which typically **are** assessed
> by how well they **extrapolate** beyond current data. For example, Newton's
> apple to the planets. They are often "validated" by their ability to "work"
> in circumstances (or scales) much different than those from which they were
> derived.
> 
> So statistical models are just fancy "prediction engines." In particular,
> there is no guarantee that they provide any meaningful assessment of
> variable importance: how predictors causally relate to the response.
> Obviously, empirical modeling can often be useful for this purpose,
> especially in well-designed studies and experiments, but there's no
> guarantee: it's an "accidental" byproduct of effective prediction.
> 
> This is particularly true for happenstance (un-designed) data and
> non-parametric models like regression/classification trees. Typically, there
> are many alternative models (trees) that give essentially the same quality
> of prediction. You can see this empirically by removing a modest random
> subset of the data and re-fitting. You should not be surprised to see the
> fitted model -- the tree topology -- change quite radically. HOWEVER, the
> predictions of the models within the extent of the data will be quite
> similar to the original results. Frank's point is that unless the data set
> is quite large and the predictive relationships quite strong -- which
> usually implies parsimony -- this is exactly what one should expect. Thus it
> is critical not to over-interpret the particular model one get, i.e. to
> infer causality from the model (tree)structure.
> 
> Incidentally, there is nothing new or radical in this; indeed, John Tukey,
> Leo Breiman, George Box, and others wrote eloquently about this decades ago.
> And Breiman's random forest modeling procedure explicitly abandoned efforts
> to build simply interpretable models (from which one might infer causality)
> in favor of building better interpolators, although assessment of "variable
> importance" does try to recover some of that interpretability (however, no
> guarantees are given).
> 
> HTH. And contrary views welcome, as always.
> 
> Cheers to all,
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
>  
>  
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Frank E Harrell Jr
> Sent: Thursday, April 01, 2010 5:02 AM
> To: vibha patel
> Cc: r-help at r-project.org
> Subject: Re: [R] fitness of regression tree: how to measure???
> 
> vibha patel wrote:
>> Hello,
>>
>> I'm using rpart function for creating regression trees.
>> now how to measure the fitness of regression tree???
>>
>> thanks n Regards,
>> Vibha
> 
> If the sample size is less than 20,000, assume that the tree is a 
> somewhat arbitrary representation of the relationships in the data and 
> that the form of the tree will not replicate in future datasets.
> 
> Frank
> 

-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University