[BioC] How to do k-fold validation using SVM

Ramon Diaz rdiaz@cnio.es
Mon, 27 Jan 2003 15:20:37 +0100


On Monday 27 January 2003 14:55, Stephen Henderson wrote:
> That looks good--thanks
> A question and a suggestion follow.
>
> 1. You say you used a linear kernel--Did you find this to be best after
> testing and/or optimising the other kernels?

Not really. At least, not after extensive testing. But with the data I am 
working linear seemed better than radial, and the people I am working with 
preferred linear than radial (and I do too: I don't really understand linear 
SVMs, much less other kernels).

>
> 2. A set of wrapper functions (for multtest, ipred, and e1071) that
> consistently interface the affy data object with a few good feature
> selection methods and classification methods might be a useful new addition
> to BioC.

I guess that is an invitation for me to write that. I actually like the 
suggestion (and I think it should be fairly easy to get my boss to regard it 
as a good idea too). So I'll probably go for it (but will be a few months 
before I can do it). [A disclaimer, though: I am not a particularly gifted R 
programmer; for example, I've managed to avoid learning anything about S4 
classes, and have no idea how Sweave works; I suppose this is an opportunity 
to learn some of this stuff]

Best,




>
> Stephen Henderson
>
>
> -----Original Message-----
> From: Ramon Diaz [mailto:rdiaz@cnio.es]
> Sent: Monday, January 27, 2003 1:38 PM
> To: bioconductor@stat.math.ethz.ch
> Cc: amateos@cnio.es
> Subject: Re: [BioC] How to do k-fold validation using SVM
>
> Sorry for jumping into the thread so late: I just read the posting today.
> Anyway, I have used the following code a couple of times, and maybe it is
> of
>
> some help to you. In each "round" (for each set of training data) I select
> the best K genes (where best means with largest abs. value of t statistic)
> and then fit the svm using those K genes.
>
>
> For a couple of reasons, I use mt.maxT (from the multtest library) to get
> the
> t statistic, but you can modify the function at your convenience (like an
> ANOVA for > 3 or whatever you want). Note also that I use a linear kernel.
>
> Hope it helps,
>
> Ramón
>
> gene.select <- function(data, class, size = NULL, threshold = NULL) {
> #    t.stat <- apply(data, 2, function(x) {abs(t.test(x ~
> class)$statistic)})
> this is slower than
>     tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)]
>         selected <- tmp[seq(1:size), 1]
>     return(selected)
> }
>
>
> cross.valid.svm <- function(data, y, knumber = 10, size = 200) {
>     ## data is structured as subjects in rows, genes in columns
>     ## (and thus is transposed inside gene.select to be fed to mt.maxT).
>
>     ## If you want leave-one-out, set knumber = NULL
>     ## size is the number of genes used when building the classifier.
>     ##  those are the "best" genes, based on a t-statistic.
>
>
> ## First, selecting the data subsets for cross-validation
>   if(is.null(knumber)) {
>         knumber <- length(y)
>         leave.one.out <- TRUE
>     }
>     else leave.one.out <- FALSE
>     N <- length(y)
>     if(knumber > N) stop(message = "knumber has to be <= number of
> subjects")
>     reps <- floor(N/knumber)
>     reps.last <- N - (knumber-1)*reps
>     index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps),
>                       rep(knumber, reps.last))
>     index.select <- sample(index.select)
>
>     cv.errors <- matrix(-99, nrow = knumber, ncol = 4)
>
>     ## Fit model for each data set.
>     for(sample.number in 1:knumber) {
>         ## gene selection
>         gene.subset <- gene.select(data[index.select != sample.number, ],
>                                     y[index.select != sample.number],
>                                     size = size)
>
>
>         ## predict from svm on that subset
>         y.f <- factor(y)
>         test.set <- data[index.select == sample.number, gene.subset]
>         if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1)
> ##
>
> for leave-one-out
>         predicted <- predict(svm(data[index.select != sample.number,
> gene.subset],
>                                  y.f[index.select != sample.number],
>                                  kernel = "linear"), test.set)
>
>         cv.errors[sample.number, 1] <- length(which((y.f[index.select ==
> sample.number] == 1)
>                                                     & predicted == 0))
>         cv.errors[sample.number, 2] <- length(which((y.f[index.select ==
> sample.number] == 0)
>                                                     & predicted == 1))
>         cv.errors[sample.number, 3] <- length(which(y.f[index.select ==
> sample.number] != predicted)) cv.errors[sample.number, 4] <-
> length(predicted)
>
>     }
>     cv.errors <- data.frame(cv.errors)
>     names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error",
> "number.predicted") average.error.rate <- sum(cv.errors[,
> 3])/sum(cv.errors[, 4])
>     return(list(cv.errors = cv.errors, average.error.rate =
> average.error.rate))
>
> }
>
>
> ## An example code:
> cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10)

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
http://bioinfo.cnio.es/~rdiaz