[R] Proper / Improper scoring Rules

Daniel Malter daniel at umd.edu
Thu Aug 13 08:15:26 CEST 2009


results=c(0.31,0.36,0.33)
names(results)=c("y=good","y=better","y=best")
results

names(results)[results==max(results)]
which(names(results)==(names(results)[results==max(results)]))

More generally, however, avoid protected operators in your variable names
(like the equality sign)! Rather choose something like y.good, y.better,
y.best, or whatever you like as variable names.

HTH,
Daniel 


-------------------------
cuncta stricte discussurus
-------------------------

-----Ursprüngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von Donald Catanzaro, PhD
Gesendet: Thursday, August 13, 2009 2:04 AM
Cc: r-help at r-project.org
Betreff: Re: [R] Proper / Improper scoring Rules

Hi All,

I have done more background research (including Frank's book) so I feel that
my second question is answered.  However, as a novice R user I still have
the following problem, accessing the output of predict.  So simplifying my
question, using the example provided in the Design package
(http://lib.stat.cmu.edu/S/Harrell/help/Design/html/predict.lrm.html) I
might do something like:

> # See help for predict.Design for several binary logistic # regression 
> examples
> 
> # Examples of predictions from ordinal models
> set.seed(1)
> y <- factor(sample(1:3, 400, TRUE), 1:3, c('good','better','best'))
> x1 <- runif(400)
> x2 <- runif(400)
> f <- lrm(y ~ rcs(x1,4)*x2)
> predict(f, type="fitted.ind")[1:10,]   #gets Prob(better) and all others
      y=good  y=better    y=best
1  0.3124704 0.3631544 0.3243752
2  0.3676075 0.3594685 0.2729240
3  0.2198274 0.3437416 0.4364309
4  0.3063463 0.3629658 0.3306879
5  0.5171323 0.3136088 0.1692590
6  0.3050115 0.3629071 0.3320813
7  0.3532452 0.3612928 0.2854620
8  0.2933928 0.3621220 0.3444852
9  0.3068595 0.3629867 0.3301538
10 0.6214710 0.2612164 0.1173126
> d <- data.frame(x1=.5,x2=.5)
> predict(f, d, type="fitted")        # Prob(Y>=j) for new observation
y>=better   y>=best 
0.6906593 0.3275849 
> predict(f, d, type="fitted.ind")    # Prob(Y=j)
   y=good  y=better    y=best 
0.3093407 0.3630744 0.3275849 


So now if I wanted to do

> out <- predict(f, d, type="fitted.ind")>

> out

   y=good  y=better    y=best 

0.3093407 0.3630744 0.3275849 

> out$"y=better"

Error in out$"y=better" : $ operator is invalid for atomic vectors

> 


y=better is the max, so how do I create something that says that ? 
(which is not exactly what I want to do but close enough to help me figure
out what R code I need to accomplish the task)

I can push the predictions out to a vector:

out.vector <- as.vector(predict(f, d, type="fitted.ind"))

> out.vector

[1] 0.3093407 0.3630744 0.3275849


which gets me part of the way because I can find out max(out.vector) but I
still need to know what column the max is in. I think the problem is that I
don't know how to manipulate data frames and vectors in R and need some
guidance

-Don 

Don Catanzaro, PhD                  
Landscape Ecologist
dgcatanzaro at gmail.com
16144 Sigmond Lane
Lowell, AR 72745
479-751-3616



Frank E Harrell Jr wrote:
> Donald Catanzaro, PhD wrote:
>> Hi All,
>>
>> I am working on some ordinal logistic regresssions using LRM in the 
>> Design package.  My response variable has three categories (1,2,3) 
>> and after using the creating my model and using a call to predict 
>> some values and I wanted to use a simple .5 cut-off to classify my 
>> probabilities into the categories.
>>
>> I had two questions:
>>
>> a)  first, I am having trouble directly accessing the probabilities 
>> which may have more to do with my lack of experience with R
>>
>> For instance, my calls
>>
>>  >ologit.three.NoPerFor <- lrm(Threshold.Three ~ TECI , data=CLD,
>> na.action=na.pass)
>>  >CLD$Threshold.Predict.Three.NoPerFor<-
>> predict(ologit.three.NoPerFor, newdata=CLD, type="fitted.ind")
>>  
>> >CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Thre
>> e.NoPerFor.Threshold.Three=1
>>  > .5] <- 1
>> Error: unexpected '=' in
>>
"CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPer
For.Threshold.Three=" 
>>
>>  >
>>  >
>>
>> produce an error message and it seems as R does not like the equal 
>> sign at all.  So how does one access the probabilities so I can 
>> classify them into the categories of 1,2,3 so I can look at 
>> performance of my model ?
>
> use == to check equality
>
>>
>> b)  which leads me to my next question.  I thought that simply 
>> calculating the percent correct off of my predictions would be 
>> sufficient to look at performance but since my question is very much 
>> in line with this thread 
>> http://tolstoy.newcastle.edu.au/R/e4/help/08/04/8987.html I am not so 
>> sure anymore.  I am afraid I did not understand Frank Harrell's last 
>> suggestion regarding improper scoring rule - can someone point me to 
>> some internet resources that I might be able to review to see why my 
>> approach would not be valid ?
>
> Percent correct will give you misleading answers and is game-able.  It 
> is also ultra-high-variance.  Though not a truly proper scoring rule, 
> Somers' Dxy rank correlation (generalization of ROC area) is helpful.
> Better still: use the log-likelihood and related quantities (deviance, 
> adequacy index as described in my book).
>
> Frank
>
>>
>>
>
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list