[BioC] Surrogate variable analysis fails with “subscript out of bounds”

Mon Jul 2 20:22:00 CEST 2012

Hi Vebjorn,

This problem is likely because of the small number of genes/features
you are considering (453) and the high dimension of the response
variable (12). With so many different levels of the response variable,
many features are likely significantly associated with the response.
Part of the iteration in the sva algorithm is to downweight features
strongly associated with the response, so the whole data set is being
down-weighted to 0.

I would suggest running only one iteration of sva. Usually it takes a
very small number of iterations to converge, and since your data are
relatively low dimensional in the number of features, this may be the
best that you can do if you are doing artifact discovery.

Best,

Jeff

On Mon, Jul 2, 2012 at 2:12 PM, Vebjorn Ljosa <ljosa at broad.mit.edu> wrote:
> (This is a crossposting of a question I have asked on
> stackoverflow.com, so far with no responses:
> http://stackoverflow.com/questions/11252724/surrogate-variable-analysis-fails-with-subscript-out-of-bounds)
>
> I am trying to apply surrogate variable analysis using [Bioconductor's
> sva package][1]. The example in [the vignette][2] works fine, but when
> I try it with real data, I get a "subscript out of bounds" error in
> `irwsva.build`:
>
>     $ R
>
>     R version 2.15.0 (2012-03-30)
>     …
>     > trainData <-
> read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData.txt')
>     > trainpheno <-
> read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
>     > testData <-
> read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData.txt')
>     > trainData <- as.matrix(trainData)
>     > testData <- as.matrix(testData)
>     > library(sva)
>     > trainMod <- model.matrix(~as.factor(label), trainpheno)
>     > num.sv(trainData, trainMod)
>     [1] 8
>     > trainMod0 <- model.matrix(~1, trainpheno)
>     > trainSv <- sva(trainData, trainMod, trainMod0)
>     Number of significant surrogate variables is:  8
>     Iteration (out of 5 ):1  2  3  4  5  Error in irwsva.build(dat =
> dat, mod = mod, mod0 = mod0, n.sv = n.sv,  :
>       subscript out of bounds
>
> An attempt to narrow it down with `debug()` revealed that `fast.svd`
> is being called on a 453 x 100 matrix of all zeros. (The dimensions
> 453 x 100 are the same as my training set.) This results in a `V` that
> is 100 x 0; the "subscript out of bounds" error is because
> `irwsva.build` attempts to index into `V`. There must be something
> about my data that causes this behaviour—but what?
>
> As a possible workaround, I tried calling `sva` with `method="two-step"`:
>
>     > trainSv <- sva(trainData, trainMod, trainMod0, method='two-step')
>     Number of significant surrogate variables is:  8
>
> That worked, but I need to subsequently call `fsva`. That failed
> because calling `sva` with `method="two-step"` resulted in
> `trainSv$pprob.b` being NULL.
>
> So how do my data differ from those in the vignette? The training and
> test data are matrices in both cases. In the vignette the training
> matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette,
> the variable of interest (_cancer_) is binary; in my case, the
> dependent variable can take 12 different values.
>
> The last difference seems to be important, for if I reduce the range
> to [0, 7], it works:
>
>     > trainMod <- model.matrix(~as.factor(label), trainpheno %% 8)
>     > trainSv <- sva(trainData, trainMod, trainMod0)
>     Number of significant surrogate variables is:  9
>     Iteration (out of 5 ):1  2  3  4  5  >
>
> Thinking that perhaps 100 samples (columns) is just insufficient for
> 12 classes, I tried a similar dataset with 293 columns. (The data are
> from the same experiment, but profile the 293 individual samples
> rather than the 100 treatments.) It did not help:
>
>     > trainData <-
> read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData3.txt')
>     > trainpheno <-
> read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
>     > trainData <- as.matrix(trainData)
>     > trainMod <- model.matrix(~as.factor(label), trainpheno)
>     > trainMod0 <- model.matrix(~1, trainpheno)
>     > trainSv <- sva(trainData, trainMod, trainMod0)
>     Number of significant surrogate variables is:  11
>     Iteration (out of 5 ):1  2  3  4  5  Error in irwsva.build(dat =
> dat, mod = mod, mod0 = mod0, n.sv = n.sv,  :
>       subscript out of bounds
>
> If I limit sva to one iteration, it is able to run to completion, but
> I don't know if I can trust the results:
>
>     > trainSv <- sva(trainData, trainMod, trainMod0, B=1)
>     Number of significant surrogate variables is:  11
>     Iteration (out of 1 ):1  >
>
> Does anyone understand `irwsva` well enough to say why this is
> happening? Is there anything I can do to make it work on my data?
>
> Thanks,
> Vebjorn
>
>   [1]: http://www.bioconductor.org/packages/release/bioc/html/sva.html
>   [2]: http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor