[R] Inconsistent results between caret+kernlab versions

Max Kuhn mxkuhn at gmail.com
Fri Nov 15 21:59:51 CET 2013


Or not!

The issue with with kernlab.

Background: SVM models do not naturally produce class probabilities. A
secondary model (via Platt) is fit to the raw model output and a
logistic function is used to translate the raw SVM output to
probability-like numbers (i.e. sum to zero, between 0 and 1). In
ksvm(), you need to use the option prob.model = TRUE to get that
second model.

I discovered some time ago that there can be a discrepancy in the
predicted classes that naturally come from the SVM model and those
derived by using the class associated with the largest class
probability. This is most likely do to natural error in the secondary
probability model and should not be unexpected.

That is the case for your data. In you use the same tuning parameters
as those suggested by train() and go straight to ksvm():

> newSVM <- ksvm(x = as.matrix(df[,-1]),
+                y = df[,1],
+                kernel = rbfdot(sigma = svm.m1$bestTune$.sigma),
+                C = svm.m1$bestTune$.C,
+                prob.model = TRUE)
>
> predict(newSVM, df[43,-1])
[1] O32078
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
> predict(newSVM, df[43,-1], type = "probabilities")
         O27479     O31403    O32057    O32059     O32060    O32078
[1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
         O32089     O32663     O32668     O32676
[1,] 0.04890477 0.05210836 0.09838892 0.07284396

Note that, based on the probability model, the class with the largest
probability is O32057 (p = 0.24) while the basic SVM model predicts
O32078 (p = 0.16).

Somebody (maybe me) saw this discrepancy and that led to me to follow this rule:

if(prob.model = TRUE) use the class with the maximum probability
   else use the class prediction from ksvm().

Therefore:

> predict(svm.m1, df[43,-1])
[1] O32057
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676

That change occurred between the two caret versions that you tested with.

(On a side note, can also occur with ksvm() and rpart() if
cost-sensitive training is used because the class designation takes
into account the costs but the class probability predictions do not. I
alerted both package maintainers to the issue some time ago.)

HTH,

Max

On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn <mxkuhn at gmail.com> wrote:
> I've looked into this a bit and the issue seems to be with caret. I've
> been looking at the svn check-ins and nothing stands out to me as the
> issue so far. The final models that are generated are the same and
> I'll try to figure out the difference.
>
> Two small notes:
>
> 1) you should set the seed to ensure reproducibility.
> 2) you really shouldn't use character stings with all numbers as
> factor levels with caret when you want class probabilities. It should
> give you a warning about this
>
> Max
>
> On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby <andrewdigby at mac.com> wrote:
>>
>> I'm using caret to assess classifier performance (and it's great!). However, I've found that my results differ between R2.* and R3.* - reported accuracies are reduced dramatically. I suspect that a code change to kernlab ksvm may be responsible (see version 5.16-24 here: http://cran.r-project.org/web/packages/caret/news.html). I get very different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + kernlab_0.9-19 (see below).
>>
>> Can anyone please shed any light on this?
>>
>> Thanks very much!
>>
>>
>> ### To replicate:
>>
>> require(repmis)  # For downloading from https
>> df <- source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', sep=',')
>> require(caret)
>> svm.m1 <- train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv', number=10, repeats=10, classProbs=TRUE))
>> svm.m1
>> sessionInfo()
>>
>> ### Results - R2.15.2
>>
>>> svm.m1
>> 1241 samples
>>    7 predictors
>>   10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>
>> No pre-processing
>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>
>> Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ...
>>
>> Resampling results across tuning parameters:
>>
>>   C     Accuracy  Kappa  Accuracy SD  Kappa SD
>>   0.25  0.684     0.63   0.0353       0.0416
>>   0.5   0.729     0.685  0.0379       0.0445
>>   1     0.756     0.716  0.0357       0.0418
>>
>> Tuning parameter ‘sigma’ was held constant at a value of 0.247
>> Kappa was used to select the optimal model using  the largest value.
>> The final values used for the model were C = 1 and sigma = 0.247.
>>> sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] e1071_1.6-1     class_7.3-5     kernlab_0.9-17  repmis_0.2.4    caret_5.15-61   reshape2_1.2.2  plyr_1.8        lattice_0.20-10 foreach_1.4.0   cluster_1.14.3
>>
>> loaded via a namespace (and not attached):
>>  [1] codetools_0.2-8 compiler_2.15.2 digest_0.6.0    evaluate_0.4.3  formatR_0.7     grid_2.15.2     httr_0.2        iterators_1.0.6 knitr_1.1       RCurl_1.95-4.1  stringr_0.6.2   tools_2.15.2
>>
>> ### Results - R3.0.2
>>
>>> require(caret)
>>> svm.m1 <- train(df[,-1],df[,1],method=’svmRadial’,metric=’Kappa’,tunelength=5,trControl=trainControl(method=’repeatedcv’, number=10, repeats=10, classProbs=TRUE))
>> Loading required package: class
>> Warning messages:
>> 1: closing unused connection 4 (https://dl.dropboxusercontent.com/u/47973221/df.Rdata)
>> 2: executing %dopar% sequentially: no parallel backend registered
>>> svm.m1
>> 1241 samples
>>    7 predictors
>>   10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>
>> No pre-processing
>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>
>> Summary of sample sizes: 1118, 1117, 1115, 1117, 1116, 1118, ...
>>
>> Resampling results across tuning parameters:
>>
>>   C     Accuracy  Kappa  Accuracy SD  Kappa SD
>>   0.25  0.372     0.278  0.033        0.0371
>>   0.5   0.39      0.297  0.0317       0.0358
>>   1     0.399     0.307  0.0289       0.0323
>>
>> Tuning parameter ‘sigma’ was held constant at a value of 0.2148907
>> Kappa was used to select the optimal model using  the largest value.
>> The final values used for the model were C = 1 and sigma = 0.215.
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] e1071_1.6-1     class_7.3-9     kernlab_0.9-19  repmis_0.2.6.2  caret_5.17-7    reshape2_1.2.2  plyr_1.8        lattice_0.20-24 foreach_1.4.1   cluster_1.14.4
>>
>> loaded via a namespace (and not attached):
>> [1] codetools_0.2-8 compiler_3.0.2  digest_0.6.3    grid_3.0.2      httr_0.2        iterators_1.0.6 RCurl_1.95-4.1  stringr_0.6.2   tools_3.0.2
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Max



-- 

Max



More information about the R-help mailing list