[BioC] Analyzing arrays with no replicates

McGee, Monnie mmcgee at mail.smu.edu
Sun Feb 20 00:35:54 CET 2005


Dear BioC,

I am trying to analyze a microarray data set with 4 samples and 54K genes.  None of the samples are replicated.  I would like to know which genes are differentially expressed among the 4 samples.

Here are the details of my data:
> print(affy.object)
AffyBatch object
size of arrays=1164x1164 features (42345 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=4
number of genes=54675
annotation=hgu133plus2

I have several questions:

1.  I used rma(affy.object) to obtain normalized, background corrected expression levels.  However, rma uses median polish to obtain expression values.  I read in the achieves of this list that median polish was not a good idea when there are no replicates, and that rlm would be better.  In order to use rlm, one needs an expression set, which means I have to specify some sort of summary method (using expresso) or take median polish (rma).  If not median polish, which summary method?  I have also read on the list that reviewers of some journals consider that median polish is the best method, and, if it is not used, the analysis is suspect (resulting in a reanalysis or a rejection).  So, if I don't use median polish (or rma) how do I justify the method I use?

2. When I do my.eset <- rma(affy.object), I obtain an object of class "exprSet", but I think the phenoData attribute is incorrect;
> print(my.eset)
Expression Set (exprSet) with 
        54675 genes
        4 samples
                 phenoData object with 1 variables and 4 cases
         varLabels
                sample: arbitrary numbering
> pData(my.eset)
          sample
Control    1
Treat1      2
Treat2      3
Treat3      4

It seems to me that the phenoData object should have 4 variables and 1 case.  In other words, phenoData sees my treatments as cases (or patients) and says that I have one variable.


3. Once I have expression levels (and a correct eset), what is the best way to do non-specific filtering?  Clearly, I don't want to test all 54K genes.  I  tried the following, suggested in one of the R working papers:

>library(genefilter)
> f1 <- pOverA(0.25, log2(100))
> f2 <- function(x) (IQR(x) > 0.5)
> ff <- filterfun(f1,f2)
> selected <- genefilter(my.eset,ff)
>sum(selected)
[1] 57

Maybe 57 genes is a great number, but it seems much smaller than the examples I've seen in papers.  Also, one of the AFFX genes is in this filtered sample.  Since they are housekeeping genes, I would think that they aren't supposed to pass through the filter.

4. Once I have my filtered genes, how can I test for differential expression among the 4 groups?  I've played around with mt.maxT (with the test="f" option), but I haven't been able to get the proper class labels.

Thanks for your help and suggestions!  Obviously, I am new to this.  I find the work fascinating, but I realize every day that I have so much more to understand.

Thanks,
Monnie

Monnie McGee, Ph.D.
Assistant Professor
Department of Statistical Science
Southern Methodist University
Ph: 214-768-2462
Fax: 214-768-4035



More information about the Bioconductor mailing list