[BioC] Normalized microarray data and meta-analysis

Thu Dec 18 17:59:41 CET 2008

I'm very excited about this discussion, and I appreciate everyone's
input.  Thanks especially to Thomas, who noted the paper indicated that
fold change is the most reproducible between groups.  

Overall, it appears that which method should be used depends largely on
the goals of the project.  I agree with Thomas that direct comparison on
p-values is often problematic precisely because of the reasons he
mentioned - that higher N will show lower p-values.  However, if your
different studies do happen to have at least similar N, then p-values
might be a good approach.  

Additionally, the use of an effect size or other statistic often favors
reproducibility of that statistic rather than actual biological
significance.  You might call a gene significantly differentially
expressed if its statistic in three different studies is 0.25, 0.25, and
0.25, even though that's not a very large statistic.  On the other hand
a gene with a statistic of 0.25, 1, and 3 would not be considered
significant simply because of the variance of the statistic between
studies.  We have chosen this approach in spite of this drawback because
our question was specifically, "which genes are consistently
differentially expressed?"  Because we're looking for "consistency" then
we have chosen to accept slight but reproducible changes.  

Finally, at least according to the paper Thomas noted, fold change
appears to be the most reproducible statistic between laboratories.
This makes sense, since a small difference in fold change can have a
large effect on p-value, so that when you compare between groups the
differences in fold change are relatively small while the differences in
p-value can be large.  However, comparing fold change between
experiments would absolutely necessitate similar normalization schemes,
as opposed to a common statistic or p-value combining method which is
more concerned with what using the authors' interpretation.

Thanks to everyone who helped with this discussion.  It is, of course,
still open.

Wyatt

K. Wyatt McMahon, Ph.D.
Texas Tech University Health Sciences Center
Department of Internal Medicine
3601 4th St. 
Lubbock, TX - 79430
806-743-4072
"It's been a good year in the lab when three things work. . . and one of
those is the lights." - Tom Maniatis

> -----Original Message-----
> From: Thomas Hampton [mailto:Thomas.H.Hampton at Dartmouth.edu]
> Sent: Wednesday, December 17, 2008 8:57 PM
> To: Paul Leo
> Cc: Mcmahon, Kevin; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Normalized microarray data and meta-analysis
> 
> I feel that p-values, corrected or otherwise, may be unsatisfactory
for
> detecting concordance between experiments. For example, an experiment
> with
> higher N will show lower p-values for the same gene, even under
> conditions that are otherwise precisely the same. So we can't compare
> p values head to head across multiple experiments directly. Simple
> simulations show
> that straight fold change can be more predictive of future behavior
> (say, in
> somebody else's study) than statistics which place a high premium on
> within-group consistency.
> 
> Check this out:
> 
> BMC Bioinformatics. 2008; 9(Suppl 9): S10.
> Published online 2008 August 12. doi: 10.1186/1471-2105-9-S9-S10.
> PMCID: PMC2537561
> Copyright (c) 2008 Shi et al; licensee BioMed Central Ltd.
> 
> The balance of reproducibility, sensitivity, and specificity of lists
> of differentially expressed genes in microarray studies
> 
> 
> Cheers
> 
> Tom
> 
> 
> On Dec 17, 2008, at 7:06 PM, Paul Leo wrote:
> 
> > No you don't need the raw data. However, do you need to check that
> > p-values were calculated the same way between experiments (will be
> > consistent if you use GEO processed data ) - what if one group did a
> > multiple testing correction and the other did not? Perhaps this is
> > already accounted for in the method you mentioned?
> >
> > You may wish to consider if you will combine p-values at the gene
> > level
> > the probe level. Most favour the probe level due to spline varients
> > etc
> >
> > If you comparing cross array platforms then you need to be very
> > careful;
> > a conservative appraoch is blast probe-to-probe across array
> platforms
> > to get the correspondence. Illumina provides "pre-basted" probes
> > sets on
> > their ftp site for ilumina-affy comparisons.
> >
> > Best of luck.
> >
> > Cheers
> > Paul
> >
> >
> > -----Original Message-----
> > From: bioconductor-bounces at stat.math.ethz.ch
> > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
Mcmahon,
> > Kevin
> > Sent: Thursday, 18 December 2008 8:31 AM
> > To: bioconductor at stat.math.ethz.ch
> > Subject: [BioC] Normalized microarray data and meta-analysis
> >
> > Hello Bioconductor-inos,
> >
> >
> >
> > I have more of a statistical/philosophical question regarding using
> > raw
> > vs. normalized data in a microarray meta-analysis.  I've looked
> > through
> > the bioconductor archives and have found some addressing of this
> > issue,
> > but not exactly what I'm concerned with.  I don't mean to waste
> > anyone's
> > time, but I was hoping I could get some help here.
> >
> >
> >
> > I've performed a meta-analysis using the downloaded data from 3
> > different GEO data sets (GDS).  It is my understanding that these
are
> > normalized data from the various microarray experiments.  Seems to
me
> > that the  data from those normalized results are normally
> distributed,
> > those three experiments are perfectly comparable (if you think the
> > author's respective normalization approaches  were reasonable).
> > All you
> > need to do is calculate some sort of effect size/determine a
> > p-value/etc. for all genes in the experimental conditions of
interest
> > and then combine these statistics across the different experiments.
> > However, I consistently read things like "raw data are required for
a
> > microarray meta-analysis."  Does this mean that normalized data are
> > not
> > directly comparable with eachother?  If so, then why does GEO even
> > host
> > such data?
> >
> >
> >
> > Any help would be wonderful!
> >
> >
> >
> > Wyatt
> >
> >
> >
> > K. Wyatt McMahon, Ph.D.
> >
> > Texas Tech University Health Sciences Center
> >
> > Department of Internal Medicine
> >
> > 3601 4th St.
> >
> > Lubbock, TX - 79430
> >
> > 806-743-4072
> >
> > "It's been a good year in the lab when three things work. . . and
> > one of
> > those is the lights." - Tom Maniatis
> >
> >
> >
> >
> > 	[[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/
> > gmane.science.biology.informatics.conductor