[R] ROC curve in R

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Jul 26 19:45:46 CEST 2007

Dylan Beaudette wrote:
> On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
>> Note that even though the ROC curve as a whole is an interesting
>> 'statistic' (its area is a linear translation of the
>> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
>> statistics), each individual point on it is an improper scoring rule,
>> i.e., a rule that is optimized by fitting an inappropriate model.  Using
>> curves to select cutoffs is a low-precision and arbitrary operation, and
>> the cutoffs do not replicate from study to study.  Probably the worst
>> problem with drawing an ROC curve is that it tempts analysts to try to
>> find cutoffs where none really exist, and it makes analysts ignore the
>> whole field of decision theory.
>> Frank Harrell
> Frank,
> This thread has caught may attention for a couple reasons, possibly related to 
> my novice-level experience. 
> 1. in a logistic regression study, where i am predicting the probability of 
> the response being 1 (for example) - there exists a continuum of probability 
> values - and a finite number of {1,0} realities when i either look within the 
> original data set, or with a new 'verification' data set. I understand that 
> drawing a line through the probabilities returned from the logistic 
> regression is a loss of information, but there are times when a 'hard' 
> decision requiring prediction of {1,0} is required. I have found that the 
> ROCR package (not necessarily the ROC Curve) can be useful in identifying the 
> probability cutoff where accuracy is maximized. Is this an unreasonable way 
> of using logistic regression as a predictor? 

Logistic regression (with suitable attention to not assuming linearity 
and to avoiding overfitting) is a great way to estimate P[Y=1].  Given 
good predicted P[Y=1] and utilities (losses, costs) for incorrect 
positive and negative decisions, an optimal decision is one that 
optimizes expected utility.  The ROC curve does not play a direct role 
in this regard.  If per-subject utilities are not available, the analyst 
may make various assumptions about utilities (including the unreasonable 
but often used assumption that utilities do not vary over subjects) to 
find a cutoff on P[Y=1].  A very nice feature of P[Y=1] is that error 
probabilities are self-contained.  For example if P[Y=1] = .02 for a 
single subject and you predict Y=0, the probability of an error is .02 
by definition.  One doesn't need to compute an overall error probability 
over the whole distribution of subjects' risks.  If the cost of a false 
negative is C, the expected cost is .02*C in this example.

> 2. The ROC curve can be a helpful way of communicating false positives / false 
> negatives to other users who are less familiar with the output and 
> interpretation of logistic regression. 

What is more useful than that is a rigorous calibration curve estimate 
to demonstrate the faithfulness of predicted P[Y=1] and a histogram 
showing the distribution of predicted P[Y=1].  Models that put a lot of 
predictions near 0 or 1 are the most discriminating.  Calibration curves 
and risk distributions are easier to explain than ROC curves.  Too often 
a statistician will solve for a cutoff on P[Y=1], imposing her own 
utility function without querying any subjects.

> 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's 
> kappa to evaluate the accuracy of a logistic regression based prediction, the 
> last two statistics based on a some probability cutoff identified before 
> hand. 

ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank 
correlation between pred. P[Y=1] and Y) is a measure of pure 
discrimination, not a measure of accuracy per se.  Rank correlation 
(concordance) measures do not require the use of cutoffs.

> How does the topic of decision theory relate to some of the circumstances 
> described above? Is there a better way to do some of these things?

See above re: expected loses/utilities.

Good questions.

> Cheers,
> Dylan
>> gyadav at ccilindia.co.in wrote:
>>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=no
>>> rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
>>> there is a lot of help try help.search("ROC curve") gave
>>> Help files with alias or concept or title matching 'ROC curve' using
>>> fuzzy matching:
>>> granulo(ade4)                             Granulometric Curves
>>> plot.roc(analogue)                        Plot ROC curves and associated
>>> diagnostics
>>> roc(analogue)                             ROC curve analysis
>>> colAUC(caTools)                           Column-wise Area Under ROC
>>> Curve (AUC)
>>> DProc(DPpackage)                          Semiparametric Bayesian ROC
>>> curve analysis
>>> cv.enet(elasticnet)                       Computes K-fold cross-validated
>>> error curve for elastic net
>>> ROC(Epi)                                  Function to compute and draw
>>> ROC-curves.
>>> lroc(epicalc)                             ROC curve
>>> cv.lars(lars)                             Computes K-fold cross-validated
>>> error curve for lars
>>> roc.demo(TeachingDemos)                   Demonstrate ROC curves by
>>> interactively building one
>>> HTH
>>> see the help and examples those will suffice
>>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
>>> Regards,
>>> Gaurav Yadav
>>> +++++++++++
>>> Assistant Manager, CCIL, Mumbai (India)
>>> Mob: +919821286118 Email: emailtogauravyadav at gmail.com
>>> Bhagavad Gita:  Man is made by his Belief, as He believes, so He is
>>> "Rithesh M. Mohan" <rithesh.m at brickworkindia.com>
>>> Sent by: r-help-bounces at stat.math.ethz.ch
>>> 07/26/2007 11:26 AM
>>> To
>>> <R-help at stat.math.ethz.ch>
>>> cc
>>> Subject
>>> [R] ROC curve in R
>>> Hi,
>>> I need to build ROC curve in R, can you please provide data steps / code
>>> or guide me through it.
>>> Thanks and Regards
>>> Rithesh M Mohan
>>>                  [[alternative HTML version deleted]]
>> -
>> Frank E Harrell Jr   Professor and Chair           School of Medicine
>>                       Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list