[R] Can ROC be used as a metric for optimal model selection for randomForest?

Max Kuhn mxkuhn at gmail.com
Fri May 13 14:48:47 CEST 2011


Frank,

It depends on how you define "optimal". While I'm not a big fan of
using the area under the ROC to characterize performance, there are a
lot of times when likelihood measures are clearly sub-optimal in
performance. Using resampled accuracy (or Kappa) instead of deviance
(out-of-bag or not) is likely to produce more inaccurate models (not
shocking, right?).

The best example is determining the number of boosting iterations.
>From Friedman (2001): ``[...] degrading the likelihood by overfitting
actually improves misclassification error rates. Although perhaps
counterintuitive, this is not a contradiction; likelihood and error
rate measure different aspects of fit quality.''

My argument here assumes that you are fitting a model for the purposes
of prediction rather than interpretation. This particular case
involves random forests, so I'm hoping that statistical inference is
not the goal.


Ref: Friedman. Greedy function approximation: a gradient boosting
machine. Annals of Statistics (2001) pp. 1189-1232


Thanks,

Max

On Fri, May 13, 2011 at 8:11 AM, Frank Harrell <f.harrell at vanderbilt.edu> wrote:
> Using anything other than deviance (or likelihood) as the objective function
> will result in a suboptimal model.
> Frank
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max



More information about the R-help mailing list