[BioC] Breaking the "most genes not differentially expressed" assumption

Paolo Innocenti paolo.innocenti at ebc.uu.se
Wed Apr 29 10:12:33 CEST 2009

Hi Wolfgang and list,

thanks for the suggestion:

Wolfgang Huber wrote:

> 1.) The correlation plot 
> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf looks 
> bizarre. Can you explain what it shows, and why you think it is 
> consistent with a successful experiment?

If you refer to the "chessboard" effect, it should simply show higher 
correlation within males and within females: the samples are indexed as 
4 males, 4 females, 4 males, 4 females, ecc... Moreover, the first set 
of 8 samples are genetically more similar than the second set of 8, 
etc... (I can't see that effect in the plot though).
In a recent paper,

Ayroles, J. F., Carbone, M. A., Stone, E. A., Jordan, K. W., Lyman, R. 
F., Magwire, M. M., Rollmann, S. M., Duncan, L. H., Lawrence, F., 
Anholt, R. R. H., & Mackay, T. F. C. 2009. Systems genetics of complex 
traits in Drosophila melanogaster. Nat Genet 41: 299-307.

they found 88% of the genes differentially expressed in M vs. F. 
(consistent with my results), and the correlation plot on their data 
(ArrayExpress E-MEXP-1594), after the same preprocessing, looks like this:


(here the samples are 2 males, 2 females, 2 males, ...)

> 2.) How does the array index relate to whether the sample is 
> male/female? Could it be that further experimental factors (time, lab, 
> reagent batch) are confounded with sex?

As I said, 4 males, 4 females, ecc...
I don't think there's any other possible "lab" effect: flies were 
flash-frozen in 1 hour interval, RNA extraction were performed in 
batches balanced for sex (half-half) and replicates, and randomized 
within sex within a 3 days interval, arrays run in batches balanced for 
sex and replicates within 3-4 weeks. Same fresh reagents, same 
protocols, same guy (me).

BUT, flies are sexually dimorphic, other than for gene expression, for 
SIZE as well. This means that extracting from whole flies gave a 
consistent difference in total yield, females giving on average 3 times 
as much as males. I'm no expert here, but might it be that difference in 
initial quantity (even though RNA has been diluted/concentrated to reach 
the same concentration before hybridization) have an effect on the 
results (relative difference in degradation rate when stored for 2 weeks 
in -80 relatively to quantity, just wild-guessing...)?

So with pre-processing I might be wiping out just technical errors 
instead of biological variation (would that be corroborated by the fact 
that I still find a HUGE effect?)

> 3.) I am puzzled by your sessionInfo(). How could you run "rma" without 
> having a cdf package loaded?

My mistake: I think a did a mistake in saving-reloading environments 
(don't know what I did exactly). Running the same code in another fresh 
session without mistakes gives the sessionInfo() attached below (with 
the cdf).

> 4.) You could try using different normalisation methods. The quantile 
> normalisation used within rma is rather aggressive. You could try 
> methods based on affine linear or local polynomial regression.

I'll give it a try with a few other normalisation methods. But the 
question remains: with an "aggressive method" I find a huge effect. If I 
apply a softer one, what am I expected to find?
At the end of the day: do my data look normal, and: am I breaking 

Thanks a lot for your help,
and thanks in advance for any additional help,

> Best wishes
>      Wolfgang

 > sessionInfo()
R version 2.8.0 (2008-10-20)


attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] limma_2.16.3         drosophila2cdf_2.3.0 affy_1.20.0
[4] Biobase_2.2.1

loaded via a namespace (and not attached):
[1] affyio_1.10.1        preprocessCore_1.4.0

> ------------------------------------------------
> Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber
> Paolo Innocenti ha scritto:
>> Hi all,
>> I have dataset of 120 Affy arrays, 60 males and 60 females.
>> The expression profiles of the 2 groups differs dramatically, i.e. if 
>> I run a standard RMA + limma, I have ~90% of the genes differentially 
>> expressed. Also, downregulated genes are twice as many than 
>> upregulated genes, although if I impose a cutoff of two-fold 
>> difference in expression, they are almost equal (15% up and 15% down).
>> This is clearly breaking the assumption that most of the genes on the 
>> array should not be differentially expressed, but the result is in 
>> line with the current knowledge of sex-biased gene expression in my 
>> model organism.
>> I have done some quality control plots, available here:
>> - Boxplot:
>> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf
>> - Frequency histogram:
>> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf
>> - RLE and NUSE plots:
>> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf
>> - CorrelationPlot:
>> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf
>> - PCA, after RMA normalization:
>> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf
>> Now, my questions are:
>> 1) Is my issue really a issue? If so, how can I perform a robust 
>> normalization of my arrays?
>> 2) Is there a tool to assess how "robust" your pre-processing method 
>> is in respect to this issue?
>> 3) Sex-biased gene expression is not the only biological question in 
>> my experiment. Is the massive size of this effect going to affect the 
>> "detectability" of other smaller effects? (through normalization or 
>> correction for multiple testing or other?)
>> Thanks,
>> paolo

Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyvägen 18D
75236 Uppsala, Sweden

More information about the Bioconductor mailing list