[R] predict function type class vs. prob

Sat Sep 23 21:24:18 CEST 2023

That's embarrassing. Apologies for the garbles HTML posting. I'll see if 
this is more readable:

On 9/23/23 05:30, Rui Barradas wrote:
> Às 11:12 de 22/09/2023, Milbert, Sabine (LGL) escreveu:
>> Dear R Help Team,
>>
>> My research group and I use R scripts for our multivariate data 
>> screening routines. During routine use, we encountered some 
>> inconsistencies within the predict() function of the R Stats Package. 

On 9/23/23 05:30, Rui Barradas wrote:
 > Às 11:12 de 22/09/2023, Milbert, Sabine (LGL) escreveu:
 >> Dear R Help Team,
 >>
 >> My research group and I use R scripts for our multivariate data 
screening routines. During routine use, we encountered some 
inconsistencies within the predict() function of the R Stats Package.

In addition to Rui's correction to this misstatement, the caret package 
is really a meta package that attempts to implement an umbrella 
framework for a vast array of tools from a wide variety of sources. It 
is an immense effort but not really a part of the core R project. The 
correct place to file issues is found in the DESCRIPTION file:

URL: https://github.com/topepo/caret/
BugReports: https://github.com/topepo/caret/issues

  If you use `str` on an object constructed with caret, you discover 
that the `predict` function is actually not in the main workspace but 
rather embedded in the fit-object itself. I think this is a rather 
general statement regarding the caret universe, and so I expect that 
your fit -objects can be examined for the code that predict.train will 
use with this approach. Your description of your analysis methods was 
rather incompletely specified, and I will put an appendix of "svm" 
methods that might be specified after my demonstration using code. (Note 
that I do not see a caret "weights" hyper-parameter for the "svmLinear" 
method which is actually using code from pkg:kernlab.)

library(caret)
svmFit <- train(Species ~ ., data = iris, method = "svmLinear",
                  trControl = trainControl(method = "cv"))

  class(svmFit)
#[1] "train"         "train.formula"
str(predict(svmFit))
  Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(svmFit)
#---screen output-------------
List of 24
  $ method      : chr "svmLinear"
  $ modelInfo   :List of 13
   ..$ label     : chr "Support Vector Machines with Linear Kernel"
   ..$ library   : chr "kernlab"
   ..$ type      : chr [1:2] "Regression" "Classification"
   ..$ parameters:'data.frame':    1 obs. of  3 variables:
   .. ..$ parameter: chr "C"
   .. ..$ class    : chr "numeric"
   .. ..$ label    : chr "Cost"
   ..$ grid      :function (x, y, len = NULL, search = "grid")
   ..$ loop      : NULL
   ..$ fit       :function (x, y, wts, param, lev, last, classProbs, ...)
   ..$ predict   :function (modelFit, newdata, submodels = NULL)
   ..$ prob      :function (modelFit, newdata, submodels = NULL)
   ..$ predictors:function (x, ...)
   ..$ tags      : chr [1:5] "Kernel Method" "Support Vector Machines" 
"Linear Regression" "Linear Classifier" ...
   ..$ levels    :function (x)
   ..$ sort      :function (x)
  $ modelType   : chr "Classification"
#  ---- large amount of screen output omitted------

# note that the class of svmFit$modelInfo$predict is 'function'
# and its code at least to this particular svm method of which there are 
about 10!

svmFit$modelInfo$predict

#---- screen output ------
function (modelFit, newdata, submodels = NULL)
{
     svmPred <- function(obj, x) {
         hasPM <- !is.null(unlist(obj using prob.model))
         if (hasPM) {
             pred <- kernlab::lev(obj)[apply(kernlab::predict(obj,
                 x, type = "probabilities"), 1, which.max)]
         }
         else pred <- kernlab::predict(obj, x)
         pred
     }
     out <- try(svmPred(modelFit, newdata), silent = TRUE)
     if (is.character(kernlab::lev(modelFit))) {
         if (class(out)[1] == "try-error") {
             warning("kernlab class prediction calculations failed; 
returning NAs")
             out <- rep("", nrow(newdata))
             out[seq(along = out)] <- NA
         }
     }
     else {
         if (class(out)[1] == "try-error") {
             warning("kernlab prediction calculations failed; returning 
NAs")
             out <- rep(NA, nrow(newdata))
         }
     }
     if (is.matrix(out))
         out <- out[, 1]
     out
}
<bytecode: 0x561277d4ec50>

-- 
David

 >> Through internal research, we were unable to find the reason for 
this and have decided to contact your help team with the following issue:
 >>
 >> The predict() function is used once to predict the class membership 
of a new sample (type = "class") on a trained linear SVM model for 
distinguishing two classes (using the caret package). It is then used to 
also examine the probability of class membership (type = "prob"). Both 
are then presented in an R shiny output. Within the routine, we noticed 
two samples (out of 100+) where the class prediction and probability 
prediction did not match. The prediction probabilities of one class 
(52%) did not match the class membership within the predict function. We 
use the same seed and the discrepancy is reproducible in this sample. 
The same problem did not occur in other trained models (lda, random 
forest, radial SVM...).

Support Vector Machines with Boundrange String Kernel (method = 
'svmBoundrangeString')