[R] Can ROC be used as a metric for optimal model selection for randomForest?

Frank Harrell f.harrell at vanderbilt.edu
Fri May 13 23:29:52 CEST 2011


Thanks for your note Max.  Part of the picture is how predictions would be
used.  If they are used in a "forced choice" way (quite a shame because the
best decision is often no decision - get more data) things are different. 
If there are gray zones or predicted probabilities are of interest then I'd
avoid ROC area as a measure and use penalized likelihood (speaking in crude
generality).

Frank


Max Kuhn wrote:
> 
> Frank,
> 
> It depends on how you define "optimal". While I'm not a big fan of
> using the area under the ROC to characterize performance, there are a
> lot of times when likelihood measures are clearly sub-optimal in
> performance. Using resampled accuracy (or Kappa) instead of deviance
> (out-of-bag or not) is likely to produce more inaccurate models (not
> shocking, right?).
> 
> The best example is determining the number of boosting iterations.
>>From Friedman (2001): ``[...] degrading the likelihood by overfitting
> actually improves misclassification error rates. Although perhaps
> counterintuitive, this is not a contradiction; likelihood and error
> rate measure different aspects of fit quality.''
> 
> My argument here assumes that you are fitting a model for the purposes
> of prediction rather than interpretation. This particular case
> involves random forests, so I'm hoping that statistical inference is
> not the goal.
> 
> 
> Ref: Friedman. Greedy function approximation: a gradient boosting
> machine. Annals of Statistics (2001) pp. 1189-1232
> 
> 
> Thanks,
> 
> Max
> 
> On Fri, May 13, 2011 at 8:11 AM, Frank Harrell
> <f.harrell at vanderbilt.edu> wrote:
>> Using anything other than deviance (or likelihood) as the objective
>> function
>> will result in a suboptimal model.
>> Frank
>>
>> -----
>> Frank Harrell
>> Department of Biostatistics, Vanderbilt University
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 
> 
> -- 
> 
> Max
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3521274.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list