[R] SVM accuracy question

Riccardo G-Mail ric.romoli at gmail.com
Tue Sep 27 17:20:15 CEST 2011


Il 27/09/11 01:58, R. Michael Weylandt ha scritto:
> Why exactly do you want to "stabilize" your results?
>
> If it's in preparation for publication/classroom demo/etc., certainly
> resetting the seed before each run (and hence getting the same sample()
> output) will make your results exactly reproducible. However, if you are
> looking for a clearer picture of the true efficacy of your svm and
> there's no real underlying order to the data set (i.e., not a time
> series), then a straight sample() seems better to me.
>
> I'm not particularly well read on the svm literature, but it sounds like
> you are worried by widely varying performance of the svm itself. If
> that's the case, it seems (to me at least) that there are certain data
> points that are strongly informative and it might be a more interesting
> question to look into which ones those are.
>
> I guess my answer, as a total non-savant in the field, is that it
> depends on your goal: repeated runs with sample will give you more
> information about the strength of the svm while setting the seed will
> give you reproducibility. Importance sampling might be of interest,
> particularly if it could be tied to the information content of each data
> point, and a quick skim of the MC variance reduction literature might
> just provide some fun insights.
>
> I'm not entirely sure how you mean to bootstrap the act of setting the
> seed (a randomly set seed seems to be the same as not setting a seed at
> all) but that might give you a nice middle ground.
>
> Sorry this can't be of more help,
>
> Michael
>
> On Mon, Sep 26, 2011 at 6:32 PM, Riccardo G-Mail <ric.romoli at gmail.com
> <mailto:ric.romoli at gmail.com>> wrote:
>
>     Hi, I'm working with support vector machine for the classification
>     purpose, and I have a problem about the accuracy of prediction.
>
>     I divided my data set in train (1/3 of enteire data set) and test
>     (2/3 of data set) using the "sample" function. Each time I perform
>     the svm model I obtain different result, according with the result
>     of the "sample" function. I would like to "stabilize" the
>     performance of my analysis. To do this I used the "set.seed"
>     function. Is there a better way to do this? Should I perform a
>     bootstrap on my work-flow (sample and svm)?
>
>     Here is an example of my workflow:
>     ### not to run
>     index <- 1:nrow(myData)
>     set.seed(23)
>     testindex <- sample(index, trunc(length(index)/3))
>     testset <- myData[testindex, ]
>     trainset <- myData[-testindex, ]
>
>     tune.svm()
>     svm.model <- svm(Factor ~ ., data = myData, cost = from tune.svm,
>                      gamma = from tune.svm, cross= 10, subset= testset)
>     summary(svm.model)
>     predict(svm.model, testset)
>
>     Best
>     Riccardo
>
>     ________________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>     PLEASE do read the posting guide
>     http://www.R-project.org/__posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
>
Thanks for your suggestion, I'm agree with you about the uselessness of 
set.seed inside a bootstrap; the idea of bootstrap exclude the set.seed. 
In my mind the bootstrap could allow me to understand the distribution 
of the prediction accuracy of the model. My doubt stems from the fact 
that I'm not a statistician.

Best



More information about the R-help mailing list