[R] Statistical significance of a classifier

Fri Aug 5 22:06:05 CEST 2005

> From: Martin C. Martin
> 
> Hi,
> 
> I have a bunch of data points x from two classes A & B, and 
> I'm creating 
> a classifier.  So I have a function f(x) which estimates the 
> probability 
> that x is in class A.  (I have an equal number of examples of 
> each, so 
> p(class) = 0.5.)
> 
> One way of seeing how well this does is to compute the error 
> rate on the 
> test set, i.e. if f(x)>0.5 call it A, and see how many times I 
> misclassify an item.  That's what MASS does.  But we should 

Surely you mean `99% of dataminers/machine learners' rather than `MASS'?

> be able to 
> do better: misclassifying should be more of a problem if the 
> regression 
> is confident then if it isn't.
> 
> How can I show that my f(x) = P(x is in class A) does better 
> than chance?

It depends on what you mean by `better'.  For some problem, people are
perfectly happy with misclassifcation rate.  For others, the estimated
probabilities count a lot more.  One possibility is to look at the ROC
curve.  Another possibility is to look at the calibration curve (see MASS
the book).

Andy

> Thanks,
> Martin
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>