[BioC] RMA-bimodality:

Tue Jun 6 16:53:59 CEST 2006

Hi all,

I almost always see the bi-modality, and I think Claus has the right 
argument, that the first peak represents not expressed/very weakly 
expressed genes. Most Affy chips have fairly good coverage of the genome, 
and hence for almost any sample a good proportion of the genes will not be 
expressed. I find it to (GC)RMA's credit that it identifies the 
non-expressed genes so clearly. Additionally, I routinely filter out genes 
if they are called "Absent" on all arrays by the MAS 5 algorithm, and this 
greatly reduces the height of the first peak. Occasionally I even filtered 
out genes unless they were "Present" on all arrays just to see what would 
happen to the distribution, and low and behold, the bimodality 
disappears!  So as opposed to seeing the bimodality as a problem, I view it 
as accurately representing the real expression distribution.

Cheers,
Jenny

At 08:17 AM 6/6/2006, Claus Mayer wrote:
>Hi Wolfgang (and everybody else)! As pointed out by you there are two 
>different issues here: a) the bi-modality of (GC)RMA normalized data on 
>many chips (which I have observed repeatedly now as well ), b) the 
>bi-modality of log(PM/MM) values as stated in the Irrizarry et al. paper. 
>In both cases the mathematical argument, that any continuous distribution 
>can be monotonely transformed into any other continuous distribution holds 
>(which is basically behind your statement that monotonous transformations 
>do not preserve the number of peaks/modes), but I still think, that the 
>observation a) of bi-modal distributions of gcrma normalized expression 
>values is worth to be discussed. Assuming GCRMA is good/perfect 
>normalisation method the normalised values should directly relate to the 
>"true" biological expressions and thus it is tempting to take such a 
>histogram as an indication of there being two classes of genes: i) genes 
>with no/small expression values (forming the first peak), ii) 
>truely/highly expressed genes (forming the second peak). If on the other 
>hand the bi-modality is an implicit by-product of the GCRMA-normalisation, 
>it doesn't make sense to interpret the bi-modality biologically in that 
>way. I have only  limited experiences with Affy arrays so far, but at 
>least in one case the bi-modality also occured (but not so clearly) when 
>using MAS5 instead of GCRMA, which I took as an indication that in this 
>case, that GCRMA didn't create the two modes, but just made it easier to 
>distinguish between them. I would be interested to hear the experiences of 
>others in this respect. Best Wishes Claus Wolfgang Huber wrote: > Hi, > > 
>I am surprised why anybody is surprised about the different number of > 
>modes ("peaks"): the number of modes of a distribution is not conserved > 
>under monotonous transformations (such as the background correction in > 
>RMA), this simply follows from chain rule. > > See below for a simple 
>example with some "mock" microarray intensities z > and density of 
>log-transformed values before and after a (primitive) > background 
>background correction. > > Cheers >  Wolfgang > > > set.seed(123) > > n = 
>100000 > z = 20 + exp(c(rnorm(n), 3+rnorm(n))) > > par(mfrow=c(1,2)) > 
>plot(density(log2(z))) > plot(density(log2(z-20))) > > > 
>noel0925 at sbcglobal.net wrote: >   >> In the paper: Exploration, 
>Normalization and Summaries >> of High Density Oligonucleotide Array Probe 
>Level Data >> the following statement regarding the >> bimodality of 
>log2(PM) values and RMA background >> corrected PM values can be found- 
>"The same bimodal >> effect is seen when we stratisfy by log2(PM), thus 
>it >> is not an artifact of conditioning on sums." (p4). >> I am a little 
>confused by this as I thought that >> indeed an artifact of the 
>convolution! >> >> Clearly, the background corrected intensity >> values 
>are given by E(S | O) or the conditional >> expectation of the signal 
>given what we observe; where >> the observed signal is the convolution of 
>a normally >> distributed background (N) mean mu variance sigma^2 >> (B~ 
>N(u, ÃÆ’^2)) and an exponentially distributed >> signal (S) with mean 
>alpha (S~ exp(ÃŽÂ±)). >> >> There have been several postings regarding 
>this matter >> in the Bioconductor archives and all seem to point to >> 
>this. Have I misunderstood? >> >> In particular was the following post: >> 
>https://stat.ethz.ch/pipermail/bioconductor/2004-August/005908.html >> 
>(See below the response from zwu at jhsph.edu >> >> The original question 
>I got was about the bimodal >> distribution of gcrma >> result from probe 
>intensities with unimodel >> distribution. My answer was >> that the 
>"change" was not necessarily surprising. >> >> For example , when you have 
>"true log signal" from a >> bimodal distribution >> 
>logS=c(rnorm(1000,3,1),rnorm(1000,8,2)) >> # You will see this has two 
>peaks >> par(mfrow=c(2,2)) >> plot(density(logS)) >> #if the background, 
>log(non-specific binding) come >> from >> logB=rnorm(2000,6,1) >> #then 
>when you plot the histogram of convolution in >> log scale, >> 
>plot(density(log(exp(logS)+exp(logB)))) >> #you see only one peak, and 
>this would be "before >> gcrma". >> >> This explanation made sense to me, 
>but seems to >> contradict what is stated in the paper. >> >> Also, can 
>someone explain the difference between RMA >> background version1 vs 
>version2? >> >> >> Best regards, >> Noel >> >> 
>_______________________________________________ >> Bioconductor mailing 
>list >> Bioconductor at stat.math.ethz.ch >> 
>https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor >> 
>  > > >   -- 
>*********************************************************************************** 
>Dr Claus-D. Mayer                    | http://www.bioss.ac.uk 
>Biomathematics & Statistics Scotland | email: claus at bioss.ac.uk Rowett 
>Research Institute            | Telephone: +44 (0) 1224 716652 Aberdeen 
>AB21 9SB, Scotland, UK.     | Fax: +44 (0) 1224 715349 
>_______________________________________________ Bioconductor mailing list 
>Bioconductor at stat.math.ethz.ch 
>https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu