[BioC] How to do k-fold validation using SVM

Mon, 27 Jan 2003 13:55:25 -0000

That looks good--thanks
A question and a suggestion follow.

1. You say you used a linear kernel--Did you find this to be best after
testing and/or optimising the other kernels?

2. A set of wrapper functions (for multtest, ipred, and e1071) that
consistently interface the affy data object with a few good feature
selection methods and classification methods might be a useful new addition
to BioC.

Stephen Henderson

-----Original Message-----
From: Ramon Diaz [mailto:rdiaz@cnio.es] 
Sent: Monday, January 27, 2003 1:38 PM
To: bioconductor@stat.math.ethz.ch
Cc: amateos@cnio.es
Subject: Re: [BioC] How to do k-fold validation using SVM

Sorry for jumping into the thread so late: I just read the posting today.
Anyway, I have used the following code a couple of times, and maybe it is of

some help to you. In each "round" (for each set of training data) I select the best K genes (where best means with largest abs. value of t statistic) and then fit the svm using those K genes.

For a couple of reasons, I use mt.maxT (from the multtest library) to get
the  
t statistic, but you can modify the function at your convenience (like an ANOVA for > 3 or whatever you want). Note also that I use a linear kernel. 

Hope it helps,

Ramón

gene.select <- function(data, class, size = NULL, threshold = NULL) {
#    t.stat <- apply(data, 2, function(x) {abs(t.test(x ~
class)$statistic)}) 
this is slower than
    tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)]
        selected <- tmp[seq(1:size), 1]
    return(selected)
}

cross.valid.svm <- function(data, y, knumber = 10, size = 200) {
    ## data is structured as subjects in rows, genes in columns
    ## (and thus is transposed inside gene.select to be fed to mt.maxT).

    ## If you want leave-one-out, set knumber = NULL
    ## size is the number of genes used when building the classifier.
    ##  those are the "best" genes, based on a t-statistic.

## First, selecting the data subsets for cross-validation
  if(is.null(knumber)) {
        knumber <- length(y)
        leave.one.out <- TRUE
    }
    else leave.one.out <- FALSE
    N <- length(y)
    if(knumber > N) stop(message = "knumber has to be <= number of
subjects")
    reps <- floor(N/knumber)
    reps.last <- N - (knumber-1)*reps
    index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps),
                      rep(knumber, reps.last))
    index.select <- sample(index.select)

    cv.errors <- matrix(-99, nrow = knumber, ncol = 4)

    ## Fit model for each data set.    
    for(sample.number in 1:knumber) {
        ## gene selection
        gene.subset <- gene.select(data[index.select != sample.number, ],
                                    y[index.select != sample.number],
                                    size = size)

        ## predict from svm on that subset
        y.f <- factor(y)
        test.set <- data[index.select == sample.number, gene.subset]
        if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1) ##

for leave-one-out
        predicted <- predict(svm(data[index.select != sample.number, 
gene.subset],
                                 y.f[index.select != sample.number],
                                 kernel = "linear"), test.set)

        cv.errors[sample.number, 1] <- length(which((y.f[index.select == 
sample.number] == 1)
                                                    & predicted == 0))
        cv.errors[sample.number, 2] <- length(which((y.f[index.select == 
sample.number] == 0)
                                                    & predicted == 1))
        cv.errors[sample.number, 3] <- length(which(y.f[index.select == sample.number] != predicted))
        cv.errors[sample.number, 4] <- length(predicted)

    }
    cv.errors <- data.frame(cv.errors)
    names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error", "number.predicted")
    average.error.rate <- sum(cv.errors[, 3])/sum(cv.errors[, 4])
    return(list(cv.errors = cv.errors, average.error.rate = 
average.error.rate))

}

## An example code:
cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10)

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
http://bioinfo.cnio.es/~rdiaz

**********************************************************************
This email and any files transmitted with it are confidential an ... [[dropped]]