[R] Analyzing Poor Performance Using naiveBayes()

Patrick Connolly p_connolly at slingshot.co.nz
Sat Sep 15 03:51:07 CEST 2012


On Thu, 09-Aug-2012 at 03:40PM -0700, Kirk Fleming wrote:

|> My data is 50,000 instances of about 200 predictor values, and for
|> all 50,000 examples I have the actual class labels (binary). The
|> data is quite unbalanced with about 10% or less of the examples
|> having a positive outcome and the remainder, of course,
|> negative. Nothing suggests the data has any order, and it doesn't
|> appear to have any, so I've pulled the first 30,000 examples to use
|> as training data, reserving the remainder for test data.
|> 
|> There are actually 3 distinct sets of class labels associated with
|> the predictor data, and I've built 3 distinct models. When each
|> model is used in predict() with the training data and true class
|> labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3
|> classifier problems.

I don't know where you got naiveBayes from so I can't check it, but my
experience with boosted regression trees might be useful.  I had AUC
values fairly similar to yours with only one tenth of the number of
instances you have.

If naiveBayes has the ability to use a validation set, I think you'll
find it makes a huge difference.  In my case, it brought the Training
AUC down to something like 0.85 but the test AUC was only slightly
less, say 0.81.

Try reserving about 20-25% of your training data for a validation set,
then calculate your AUC on the combined Training and validation data.
It will probably go down somewhat but your Test AUC will look much
better.

I'd be interested to know what you discover.


-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.



More information about the R-help mailing list