[R] svm

Sat Jan 9 21:48:49 CET 2010

Hi,

On Fri, Jan 8, 2010 at 11:57 AM, Amy Hessen <amy_4_5_84 at hotmail.com> wrote:
> Hi Steve,
>
> Thank you very much for your reply. Your code is more readable and obvious than mine…

No Problem.

> Could you please help me in these questions?:
>
> 1) “Formula” is an alternative to “y” parameter in SVM. is it correct?

No, that's not correct.

There are two svm functions, one that takes a "formula" object
(svm.formula), and one that takes an x matrix, and a y vector
(svm.default). The svm.formula function is called when the first
argument in your "svm(..)" call is a formula object. This function
simply parses the formula and manipulates your data object into an x
matrix and y vector, then calls the svm.default function with those
params ... I usually prefer to just skip the formula and provide the x
and y objects directly.

Load the e1071 library and look at the source code:

R> library(e1071)
R> e1071:::svm.formula

You'll see what I mean.

> 2) I forgot to remove the “class label” from the dataset besides I gave the
> program the class label in formula parameter but the program works! Could
> you please clarify this point to me?

The author of the e1071 package did you a favor. The predict.svm
function checks to see if your svm object was built using the formula
interface .. if so, it looks for you label column in the data you are
trying to predict on and ignores it.

Look at the function's source code (eg, type e1071:::predict.svm at
the R prompt), and look for the call to the delete.response function
... you can also look at the help in ?delete.response.

-steve

>> Date: Wed, 6 Jan 2010 18:44:13 -0500
>> Subject: Re: [R] svm
>> From: mailinglist.honeypot at gmail.com
>> To: amy_4_5_84 at hotmail.com
>> CC: r-help at r-project.org
>>
>> Hi Amy,
>>
>> On Wed, Jan 6, 2010 at 4:33 PM, Amy Hessen <amy_4_5_84 at hotmail.com> wrote:
>> > Hi Steve,
>> >
>> > Thank you very much for your reply.
>> >
>> > I’m trying to do something systematic/general in the program so that I
>> > can
>> > try different datasets without changing much in the program (without
>> > knowing
>> > the name of the class label that has different name from dataset to
>> > another…)
>> >
>> > Could you please tell me your opinion about this code:-
>> >
>> > library(e1071)
>> >
>> > mydata<-read.delim("the_whole_dataset.txt")
>> >
>> > class_label <- names(mydata)[1]                        # I’ll always put
>> > the
>> > class label in the first column.
>> >
>> > myformula <- formula(paste(class_label,"~ ."))
>> >
>> > x <- subset(mydata, select = - mydata[, 1])
>> >
>> > mymodel<-(svm(myformula, x, cross=3))
>> >
>> > summary(model)
>> >
>> > ################
>>
>> Since you're not doing anything funky with the formula, a preference
>> of mine is to just skip this way of calling SVM and go "straight" to
>> the svm(x,y,...) method:
>>
>> R> mydata <- as.matrix(read.delim("the_whole_dataset.txt"))
>> R> train.x <- mydata[,-1]
>> R> train.y <- mydata[,1]
>>
>> R> mymodel <- svm(train.x, train.y, cross=3, type="C-classification")
>> ## or
>> R> mymodel <- svm(train.x, train.y, cross=3, type="eps-regression")
>>
>> As an aside, I also like to be explicit about the type="" parameter to
>> tell what I want my SVM to do (regression or classification). If it's
>> not specified, the SVM picks which one to do based on whether or not
>> your y vector is a vector of factors (does classification), or not
>> (does regression)
>>
>> > Do I have to the same steps with testingset? i.e. the testing set must
>> > not
>> > contain the label too? But contains the same structure as the training
>> > set?
>> > Is it correct?
>>
>> I guess you'll want to report your accuracy/MSE/something on your
>> model for your testing set? Just load the data in the same way then
>> use `predict` to calculate the metric your after. You'll have to have
>> the labels for your data to do that, though, eg:
>>
>> testdata <- as.matrix(read.delim('testdata.txt'))
>> test.x <- testdata[,-1]
>> test.y <- testdata[,1]
>> preds <- predict(mymodel, test.x)
>>
>> Let's assume you're doing classification, so let's report the accuracy:
>>
>> acc <- sum(preds == test.y) / length(test.y)
>>
>> Does that help?
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> ________________________________
> Sell your old one fast! Time for a new car?

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact