[BioC] VSN: minimum number of controls?

Fri Apr 2 23:41:44 CEST 2010

Hello,

In my first project with R and BioConductor, I am analyzing some small
microarrays, starting with variance normalization with vsn.  Using
Wolfgang Huber's VSN.pdf tutorial I was able to do the exercise with the
"kidney" dataset without trouble.  However, when trying to run:

>  fit = vsn2( noDNAcontrols )
Error in .local(x, reference, strata, ...) :
  One or more of the strata contain less than 42 elements.
Please reduce the number of strata so that there is enough in each stratum.

using my own data, I got the error above.  I finally got around the
error by simulating a dataset containing 50 controls (my original data
had only 6).  Surprisingly, even 42 controls was insufficient.

A collaborator, using the same dataset, was able to run vsn successfully
using an earlier version of R (2.9.0) and Bioconductor (version ?).

Is anyone familiar with this problem?

I see two ways forward:

1,  Find the appropriate (old) version of Bioconductor and analyze with
the original controls.

2.  Use the current R/Bioconductor releases and either find a software
patch or a work-around.

As for #2, maybe it is not unreasonable to use >42 controls on most
microarrays.  However, this particular dataset is from a series of small
protein arrays (each probed with patient serum then visualized with
labeled anti-IgG) that contain only 214 antigens and 6 no DNA (meaning
"no protein") controls per patient (with a total 853 patients in the
dataset).  Consequently, it is not possible to run a huge number of
controls, given the number of experimental cells per slide.

On a related note, in my effort to inflate the controls that I did have
into a sufficiently large number, I used "rnorm" to simulate/synthesize
the data.  Here "noDNAstats" is a 2 x 853 matrix consisting of the mean
and standard deviation from the patients' noDNAcontrols in the first and
second rows, respectively.

i=1
noDNAsim50 = rnorm(50, noDNAstats[1,i], noDNAstats[2,i])
for(i in c( 2:ncol(noDNAstats) ) ){
    noDNAsim50 = cbind(noDNAsim50, rnorm(50, noDNAstats[1,i],
    noDNAstats[2,i]))
}

My understanding was that rnorm would create a dataset of the requested
size with the requested mean and SD.  The numbers I get are in the same
ballpark but the means and SD are not the same.  Am I missing something?

Thanks!
eesnyder
-- 
Eric E. Snyder, Ph.D.
Virginia Bioinformatics Institute
Virginia Polytechnic Institute and State University
Blacksburg, VA 24061-0447
USA

Email:   eesnyder at vbi.vt.edu
Phone:   (540) 231-5428
JDAM:    N 37 13.248', W 80 25.551'