[R] SVM probability output variation

Wed Oct 21 19:05:37 CEST 2009

Hi again, and thank you Steve for your reply!

> Hi Anders,
> 
> On Oct 21, 2009, at 8:49 AM, Anders Carlsson wrote:
> 
> > Dear R:ers,
> >
> > I'm using the svm from the e1071 package to train a model with the
> > option "probabilities = TRUE". I then use "predict" with
> > "probabilities = TRUE" and get the probabilities for the data point
> > belonging to either class. So far all is well.
> >
> > My question is why I get different results each time I train the
> > model, although I use exactly the same data. The prediction seems to
> > be reproducible, but if I re-train the model, the probabilities vary
> > some what.
> >
> > Here, I have trained a model on exactly the same data five times.
> > When predicting using the different models, this is how the
> > probabilities vary:
> 
> I'm not sure I'm following the example your giving and the scenario
> you are describing.

I think you got it!

> 
> > probabilities
> > Grp.0        Grp.1
> > 0.7077155    0.2922845
> > 0.7938782    0.2061218
> > 0.8178833    0.1821167
> > 0.7122203    0.2877797
> 
> This seems fine to me: it looks like the probabilities of class
> membership for 4 examples (Note that Grp.0 + Grp.1 = 1).
> 

Yes, within each run all was OK, but I was surprised that it varied to such a high degree.

> 
> > How can the predictions using the same training and test data vary
> > so much?
> 
> I'm trying the code below several times (taken from the example), and
> the probabilities calculated from the call to prediction don't change
> much at all:
> 
> R> data(iris)
> R> attach(iris)
> 
> R> model <- svm(x, y, probability=TRUE)
> R> predict(model, x, probability=TRUE)
> 
> To be fair, the probabilities aren't exactly the same, but the
> difference between two runs is really small:
> 
> R> model <- svm(x, y, probability=TRUE)
> R> a <- predict(model, x, probability=TRUE)
> 
> R> model <- svm(x, y, probability=TRUE)
> R> b <- predict(model, x, probability=TRUE)
> 
> R> mean(abs(attr(a, 'probabilities') - attr(b, 'probabilities')))
> [1] 0.003215959
> 
> Is this what you were talking about, or ... ?

Yes, exactly that. In your example, though, the variation seems to be a lot smaller. I'm guessing that has to with the data. 

If I instead output the decision values, the whole procedure is fully reproducible, i.e. the exact same values are returned when I retrain the model. 

I have no idea how the probabilities are calculated, but it seems to be in this step that the differences arise. In my case, I feel a bit hesitant to use them when they differ that much between runs (15% or so)... 

If important, I use a linear kernel and don't tune the model in any way.

Thank's again!

/Anders

> 
> -steve
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>    |  Memorial Sloan-Kettering Cancer Center
>    |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact