[R] gbm for cost-sensitive binary classification?

Tang Yuchun tyczjs at yahoo.com
Wed Jun 17 21:05:52 CEST 2009


(sorry to post it again with plain text).

I recently use gbm for a binary classification problem. As expected, it gets very good results, based on Area under ROC with 7-fold cross validation. However, the application (malware detection) is cost-sensitive, getting a FP (classify a clean sample as a dirty one) is much worse than getting a FN (miss a dirty sample). I would like to tune the gbm model biased to very low FP rate. The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher the better.

For this purpose, I tried both weighting and sampling strategies, but both of them do not work as I expect yet. I notice that there is a weight vector and hence I tried to overwight on clean side (10 for each clean sample and 1 for each dirty sample), but I don't see big difference from gbm modeling without weighting. I also try to feed an imbalanced data into gbm (in the dataset, clean samples are 10 times more than dirty samples),  it still not work.

I think I miss sth here. I would very much appreciate if anyone can advise me how to implement cost-sensitive classification with gbm. Follows is the gbm modeling scirpt I used.

model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution = "bernoulli",w = tr.w,var.monotone = NULL,n.trees = NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage = 0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose = TRUE,var.names = NULL,response.name = NULL);

or 

model.gbm  <- gbm(tr.y ~ .,distribution = "bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights = tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction = 1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE);


 
------------------------------------
Yuchun Tang, Ph.D.
Principal Engineer, Lead
 
McAfee, Inc.
4800 North Point Parkway
Suite 300
Alpharetta,
GA  30022
 
Main:     678.904.9153
www.mcafee.com
www.trustedsource.org




More information about the R-help mailing list