[BioC] Support vector regression

Mon Mar 31 18:27:16 CEST 2014

Hi,

On Mon, Mar 31, 2014 at 9:06 AM, Paul [guest] <guest at bioconductor.org> wrote:
>
> For convenience sake, I use the example data to ask the question. I use QSAR.XLS [http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/qsar.zip]
>
> Considering the donors from the dataset as predictor variables and Activity as the resposne variable, I would like to do a support vector regression using both linear and non-linear kernels.
>
>  In my case, I would like to find which of the predictors (out of the 20 donors) best explain the activity (response) and did the following:
>
>
>          fit <- svm(activity ~ ., data=qsar,kernel='linear',type="eps-regression")
>          Call:
>          svm(formula = activity ~ ., data = qsar, kernel = "linear", type = "eps-regression")
>
>
>          Parameters:
>          SVM-Type:  eps-regression
>          SVM-Kernel:  linear
>          cost:  1
>          gamma:  0.04347826
>          epsilon:  0.1
>          Number of Support Vectors:  66
>
> How to determine now which are the best predictors (out of the 20) which explain the activity and get the R-squared values ?

SVMs aren't the easiest to do this with. The trained model is (should
be) sparse in *example* space, so you know which examples contribute
most to your decision boundary, but you are left to reverse engineer
how the features in each sample are responsible for that (given the
kernel you use).

Depending on the kernel you use, you can take the values in the W
vector from the SVM as a feature ranking type of approach, but this
gets complicated fast.

You might try using a method that enforces sparsity in the feature
space: try the glmnet package. You could also try penalizedSVM, but (I
believe) the last time I checked you could only use linear kernel
(although I could easily be mistaken).

Also, looking at your results,  you have 66 support vectors out of a
dataset of 75 examples (so the model is not sparse with respect to the
number of examples you used to train on). Typically you like to see
the number of support vectors to be relatively fewer than the number
of examples you are training on as a good sign of the fitness of your
model.

But the number of SVs aren't the *real* thing you are interested in,
you'd rather want to do some cross validation to ensure that the model
is actually generalizable (ie. how well does it predict on held out
examples). Once you get something that looks promising, I'd then spend
time trying to figure out how to extract features from it.

I see in your dataset you've annotated some rows as train and test but
you're not using that information just yet.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech