[BioC] Siggenes/SAM vs Excel

Tue Jul 8 18:47:25 CEST 2008

>On my data, SAM certainly does a great job compared to a t test.
>However, like Guo of MAQC fame, I get better
>concordance between replicates with a fold change and loose p value
>cutoff. This seems far too simple
>to be better than SAM, but there you have it. Bioc's  RankProd also
>gives me good concordance. It does . however, order
>its gene lists by straight fold change...

I cringed a little when I read the MAQC's overall conclusion that 
(given replicates) FC or FC + loose p-values gives the best 
concordance between gene lists from replicate experiments, different 
pre-processing methods and/or different platforms. It seems like it 
gives permission to almost completely ignore statistical treatment of 
the data and go back to just using FC to identify differentially 
expressed genes!  However, I don't think they adequately pointed out 
that "concordance between gene lists" doesn't really mean those genes 
are "all the truly DE genes" but rather "only those large FCers that 
can be detected no matter what measuring method or data 
pre-processing algorithm is used."

It's also worth pointing out that MAQC used TECHNICAL replicates 
only, and compared very disparate mRNA populations - total human RNA 
vs. human brain RNA, iirc. In real microarray experiments, one uses 
(hopefully!) biological replicates and compares mRNA populations that 
are much more similar. Of course, both of these just make it more 
difficult to detect differences, and the MAQC's recommendation to use 
FC to find "repeatable" lists would be even more applicable in this 
situation! However, most researchers that I work with have way too 
simplistic a view of what "gene expression" actually is and how it is 
measured; they assume there is one "real" level of expression that 
indicates the "real" protein level, and that all methods should be 
able to measure this "real" level, else they are bad methods.

Case in point: I had a client that was testing a KO's effect on 
expression patterns. In the Affymetrix microarray data, the KO gene 
was actually UP-regulated in the KO samples! They of course got all 
upset over this because they had confirmed the KO with qPCR. It turns 
out they had only deleted two exons and their PCR primers measured 
this region of the transcript. However, the Affy probes were for a 
different exon that was still being transcribed. The lack of 
functional protein likely caused the samples to up-regulate mRNA 
production. The moral of the story: it is possible for different 
methods of measuring mRNA levels to give different answers, YET STILL 
BOTH BE CORRECT because they are usually measuring different sections 
of the transcript with different methods. I tell this story every 
time a researcher tells me they can't "confirm" the microarray 
results via qPCR :)

So it's fine to use FC or FC + loose p-value if you only want to find 
genes with very large expression differences, and not other genes 
which also may have important expression differences. Genes with 
different expression patterns according to different measuring 
methods may be very interesting genes after all!

Jenny

>  Perhaps SAM lists ordered by  fold change and cut off at a certain
>FDR would create optimal concordance. We do not really expect FDR to be
>more reproducible than fold change, do we?
>
>Anyway, thanks for all your input into this list.
>
>yours,
>
>Tom
>
>
>
>
>
>On Jul 3, 2008, at 11:04 PM, Holger Schwender wrote:
>
>>Hi Tom,
>>
>>not sure how often I have answered this question here and in other
>>forums. But okay once again: The defaults of siggenes and Excel SAM
>>are a bit different:
>>
>>- In siggenes, a moderated Welch's t-statistic is computed by
>>default, whereas a moderated version of the ordinary t-statistic
>>assuming equal group variances is used in Excel SAM. Set
>>var.equal=TRUE in sam to use the ordinary t-statistic (only
>>necessary if the sizes of the groups differ).
>>
>>- In siggenes, the mean number of falsely called genes (warning:
>>this is *not* the expected number of false positives) is computed,
>>whereas the median number is used by Excel SAM. Set med=TRUE in sam
>>to use the median number.
>>
>>- In siggenes, a natural cubic spline based approach is used to
>>estimate pi0, whereas Excel SAM uses an adhoc estimate. Set
>>lambda=0.5 in sam to use this adhoc estimate.
>>
>>- Even though I have implemented the computation of the fudge
>>factor s0 exactly as described in the Excel SAM manual, the value
>>of s0 usually differs between siggenes and Excel. Not sure why.
>>
>>- In siggenes s0=0 is also a choice for the fudge factor. In the
>>old version of Excel SAM it is not. Not sure about the new version.
>>
>>- Have not found a description on how the q-values are estimated in
>>Excel SAM. The values of the q-values usually differ between
>>siggenes and Excel. In siggenes, the computation of the q-values is
>>implemented in virtually the same way as in John Storey's R package
>>qvalue such that q-values are typically the same. They only differ
>>when there are tied p-values, since siggenes handles ties a bit
>>different (in my opinion more correctly) than John's function qvalue.
>>
>>- The same seed for the random number generator will not lead to
>>the same permutations of the response.
>>
>>Best,
>>Holger
>>
>>
>>
>>-------- Original-Nachricht --------
>>>Datum: Thu, 3 Jul 2008 09:21:41 -0400
>>>Von: Thomas Hampton <Thomas.H.Hampton at Dartmouth.EDU>
>>>An: "Holger Schwender" <holger.schw at gmx.de>
>>>CC: bioc <bioconductor at stat.math.ethz.ch>
>>>Betreff: [BioC] Siggenes/SAM vs Excel
>>
>>>How similar is  siggenes to the Stanford SAM with the Excel front
>>>end?
>>>
>>>Thanks!
>>>
>>>Tom
>>
>>--
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>Search the archives: http://news.gmane.org/ 
>>gmane.science.biology.informatics.conductor
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu