[BioC] Breaking the "most genes not differentially expressed" assumption
paolo.innocenti at ebc.uu.se
Wed Apr 29 10:12:33 CEST 2009
Hi Wolfgang and list,
thanks for the suggestion:
Wolfgang Huber wrote:
> 1.) The correlation plot
> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf looks
> bizarre. Can you explain what it shows, and why you think it is
> consistent with a successful experiment?
If you refer to the "chessboard" effect, it should simply show higher
correlation within males and within females: the samples are indexed as
4 males, 4 females, 4 males, 4 females, ecc... Moreover, the first set
of 8 samples are genetically more similar than the second set of 8,
etc... (I can't see that effect in the plot though).
In a recent paper,
Ayroles, J. F., Carbone, M. A., Stone, E. A., Jordan, K. W., Lyman, R.
F., Magwire, M. M., Rollmann, S. M., Duncan, L. H., Lawrence, F.,
Anholt, R. R. H., & Mackay, T. F. C. 2009. Systems genetics of complex
traits in Drosophila melanogaster. Nat Genet 41: 299-307.
they found 88% of the genes differentially expressed in M vs. F.
(consistent with my results), and the correlation plot on their data
(ArrayExpress E-MEXP-1594), after the same preprocessing, looks like this:
(here the samples are 2 males, 2 females, 2 males, ...)
> 2.) How does the array index relate to whether the sample is
> male/female? Could it be that further experimental factors (time, lab,
> reagent batch) are confounded with sex?
As I said, 4 males, 4 females, ecc...
I don't think there's any other possible "lab" effect: flies were
flash-frozen in 1 hour interval, RNA extraction were performed in
batches balanced for sex (half-half) and replicates, and randomized
within sex within a 3 days interval, arrays run in batches balanced for
sex and replicates within 3-4 weeks. Same fresh reagents, same
protocols, same guy (me).
BUT, flies are sexually dimorphic, other than for gene expression, for
SIZE as well. This means that extracting from whole flies gave a
consistent difference in total yield, females giving on average 3 times
as much as males. I'm no expert here, but might it be that difference in
initial quantity (even though RNA has been diluted/concentrated to reach
the same concentration before hybridization) have an effect on the
results (relative difference in degradation rate when stored for 2 weeks
in -80 relatively to quantity, just wild-guessing...)?
So with pre-processing I might be wiping out just technical errors
instead of biological variation (would that be corroborated by the fact
that I still find a HUGE effect?)
> 3.) I am puzzled by your sessionInfo(). How could you run "rma" without
> having a cdf package loaded?
My mistake: I think a did a mistake in saving-reloading environments
(don't know what I did exactly). Running the same code in another fresh
session without mistakes gives the sessionInfo() attached below (with
> 4.) You could try using different normalisation methods. The quantile
> normalisation used within rma is rather aggressive. You could try
> methods based on affine linear or local polynomial regression.
I'll give it a try with a few other normalisation methods. But the
question remains: with an "aggressive method" I find a huge effect. If I
apply a softer one, what am I expected to find?
At the end of the day: do my data look normal, and: am I breaking
Thanks a lot for your help,
and thanks in advance for any additional help,
> Best wishes
R version 2.8.0 (2008-10-20)
attached base packages:
 tools stats graphics grDevices utils datasets methods
other attached packages:
 limma_2.16.3 drosophila2cdf_2.3.0 affy_1.20.0
loaded via a namespace (and not attached):
 affyio_1.10.1 preprocessCore_1.4.0
> Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber
> Paolo Innocenti ha scritto:
>> Hi all,
>> I have dataset of 120 Affy arrays, 60 males and 60 females.
>> The expression profiles of the 2 groups differs dramatically, i.e. if
>> I run a standard RMA + limma, I have ~90% of the genes differentially
>> expressed. Also, downregulated genes are twice as many than
>> upregulated genes, although if I impose a cutoff of two-fold
>> difference in expression, they are almost equal (15% up and 15% down).
>> This is clearly breaking the assumption that most of the genes on the
>> array should not be differentially expressed, but the result is in
>> line with the current knowledge of sex-biased gene expression in my
>> model organism.
>> I have done some quality control plots, available here:
>> - Boxplot:
>> - Frequency histogram:
>> - RLE and NUSE plots:
>> - CorrelationPlot:
>> - PCA, after RMA normalization:
>> Now, my questions are:
>> 1) Is my issue really a issue? If so, how can I perform a robust
>> normalization of my arrays?
>> 2) Is there a tool to assess how "robust" your pre-processing method
>> is in respect to this issue?
>> 3) Sex-biased gene expression is not the only biological question in
>> my experiment. Is the massive size of this effect going to affect the
>> "detectability" of other smaller effects? (through normalization or
>> correction for multiple testing or other?)
Department of Animal Ecology, EBC
75236 Uppsala, Sweden
More information about the Bioconductor