[BioC] Siggenes/SAM vs Excel
drnevich at illinois.edu
Tue Jul 8 18:47:25 CEST 2008
>On my data, SAM certainly does a great job compared to a t test.
>However, like Guo of MAQC fame, I get better
>concordance between replicates with a fold change and loose p value
>cutoff. This seems far too simple
>to be better than SAM, but there you have it. Bioc's RankProd also
>gives me good concordance. It does . however, order
>its gene lists by straight fold change...
I cringed a little when I read the MAQC's overall conclusion that
(given replicates) FC or FC + loose p-values gives the best
concordance between gene lists from replicate experiments, different
pre-processing methods and/or different platforms. It seems like it
gives permission to almost completely ignore statistical treatment of
the data and go back to just using FC to identify differentially
expressed genes! However, I don't think they adequately pointed out
that "concordance between gene lists" doesn't really mean those genes
are "all the truly DE genes" but rather "only those large FCers that
can be detected no matter what measuring method or data
pre-processing algorithm is used."
It's also worth pointing out that MAQC used TECHNICAL replicates
only, and compared very disparate mRNA populations - total human RNA
vs. human brain RNA, iirc. In real microarray experiments, one uses
(hopefully!) biological replicates and compares mRNA populations that
are much more similar. Of course, both of these just make it more
difficult to detect differences, and the MAQC's recommendation to use
FC to find "repeatable" lists would be even more applicable in this
situation! However, most researchers that I work with have way too
simplistic a view of what "gene expression" actually is and how it is
measured; they assume there is one "real" level of expression that
indicates the "real" protein level, and that all methods should be
able to measure this "real" level, else they are bad methods.
Case in point: I had a client that was testing a KO's effect on
expression patterns. In the Affymetrix microarray data, the KO gene
was actually UP-regulated in the KO samples! They of course got all
upset over this because they had confirmed the KO with qPCR. It turns
out they had only deleted two exons and their PCR primers measured
this region of the transcript. However, the Affy probes were for a
different exon that was still being transcribed. The lack of
functional protein likely caused the samples to up-regulate mRNA
production. The moral of the story: it is possible for different
methods of measuring mRNA levels to give different answers, YET STILL
BOTH BE CORRECT because they are usually measuring different sections
of the transcript with different methods. I tell this story every
time a researcher tells me they can't "confirm" the microarray
results via qPCR :)
So it's fine to use FC or FC + loose p-value if you only want to find
genes with very large expression differences, and not other genes
which also may have important expression differences. Genes with
different expression patterns according to different measuring
methods may be very interesting genes after all!
> Perhaps SAM lists ordered by fold change and cut off at a certain
>FDR would create optimal concordance. We do not really expect FDR to be
>more reproducible than fold change, do we?
>Anyway, thanks for all your input into this list.
>On Jul 3, 2008, at 11:04 PM, Holger Schwender wrote:
>>not sure how often I have answered this question here and in other
>>forums. But okay once again: The defaults of siggenes and Excel SAM
>>are a bit different:
>>- In siggenes, a moderated Welch's t-statistic is computed by
>>default, whereas a moderated version of the ordinary t-statistic
>>assuming equal group variances is used in Excel SAM. Set
>>var.equal=TRUE in sam to use the ordinary t-statistic (only
>>necessary if the sizes of the groups differ).
>>- In siggenes, the mean number of falsely called genes (warning:
>>this is *not* the expected number of false positives) is computed,
>>whereas the median number is used by Excel SAM. Set med=TRUE in sam
>>to use the median number.
>>- In siggenes, a natural cubic spline based approach is used to
>>estimate pi0, whereas Excel SAM uses an adhoc estimate. Set
>>lambda=0.5 in sam to use this adhoc estimate.
>>- Even though I have implemented the computation of the fudge
>>factor s0 exactly as described in the Excel SAM manual, the value
>>of s0 usually differs between siggenes and Excel. Not sure why.
>>- In siggenes s0=0 is also a choice for the fudge factor. In the
>>old version of Excel SAM it is not. Not sure about the new version.
>>- Have not found a description on how the q-values are estimated in
>>Excel SAM. The values of the q-values usually differ between
>>siggenes and Excel. In siggenes, the computation of the q-values is
>>implemented in virtually the same way as in John Storey's R package
>>qvalue such that q-values are typically the same. They only differ
>>when there are tied p-values, since siggenes handles ties a bit
>>different (in my opinion more correctly) than John's function qvalue.
>>- The same seed for the random number generator will not lead to
>>the same permutations of the response.
>>-------- Original-Nachricht --------
>>>Datum: Thu, 3 Jul 2008 09:21:41 -0400
>>>Von: Thomas Hampton <Thomas.H.Hampton at Dartmouth.EDU>
>>>An: "Holger Schwender" <holger.schw at gmx.de>
>>>CC: bioc <bioconductor at stat.math.ethz.ch>
>>>Betreff: [BioC] Siggenes/SAM vs Excel
>>>How similar is siggenes to the Stanford SAM with the Excel front
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>Search the archives: http://news.gmane.org/
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>Search the archives:
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
1201 W. Gregory Dr.
Urbana, IL 61801
e-mail: drnevich at illinois.edu
More information about the Bioconductor