[R] rpart minimum sample size

Frank E Harrell Jr f.harrell at vanderbilt.edu
Tue Feb 27 17:08:51 CET 2007


Amy Uhrin wrote:
> Is there an optimal / minimum sample size for attempting to construct a 
> classification tree using /rpart/?
> 
> I have 27 seagrass disturbance sites (boat groundings) that have been 
> monitored for a number of years.  The monitoring protocol for each site 
> is identical.  From the monitoring data, I am able to determine the 
> level of recovery that each site has experienced.  Recovery is our 
> categorical dependent variable with values of none, low, medium, high 
> which are based upon percent seagrass regrowth into the injury over 
> time.  I wish to be able to predict the level of recovery of future 
> vessel grounding sites based upon a number of categorical / continuous 
> predictor variables used here including (but not limited to) such 
> parameters as:  sediment grain size, wave exposure, original size 
> (volume) of the injury, injury age, injury location.
> 
> When I run /rpart/, the data is split into only two terminal nodes based 
> solely upon values of the original volume of each injury.  No other 
> predictor variables are considered, even though I have included about 
> six of them in the model.  When I remove volume from the model the same 
> thing happens but with injury area - two terminal nodes are formed based 
> upon area values and no other variables appear.  I was hoping that this 
> was a programming issue, me being a newbie and all, but I really think 
> I've got the code right.  Now I am beginning to wonder if my N is too 
> small for this method?
> 

In my experience N needs to be around 20,000 to get both good accuracy 
and replicability of patterns if the number of potential predictors is 
not tiny.  In general, the R^2 from rpart is not competitive with that 
from an intelligently fitted regression model.  It's just a difficult 
problem, when relying on a single tree (hence the popularity of random 
forests, bagging, boosting).

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list