[BioC] Opinions on array design, normalization, and linear modeling with LIMMA

Fri Nov 2 12:27:17 CET 2007

Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:

> I agree here, the scale on the y-axis is quite dramatic. Note that we
> are not necessarily saying that too many genes are DE, but that some
> of them have dramatic fold changes.

It really depends on the biology of teh experiment, and as during  
embryogenesis you have quite dramatic changes, I don't think the range  
of the M values is something to worry about... at least not without  
checking the biology first. The original poster seemed to expect a lot  
of variation between the time points compared.
I have seem similar MA plots, when comparing for instance two cell  
lines that are supposedly derived from the same tissue... (a totally  
different problem, I know...)

> Most of the normalization techniques are derived under the assumption
> that not too many genes are DE. Facing your problem of many DE genes,
> some people would say "clearly the assumptions are not correct". I
> would say that you should use the methods that gives you the best
> inference. Sometimes people have observed that applying the
> "standard" normalization techniques actually improve their calls,
> even on datasets with many DE genes.

I don't think that's entirely correct. I don't think that the  
assumption is that not too many genes are not DE, but that *most*  
genes are not DE, or they're evenly spread between up/downregulation  
across the range of raw intensities measured. It's a fine distinction.
Imagine an MA plot (raw data) where everything lies around the M=0  
line, very tightly, with just a few genes straying up to higher |M|  
values. Then imagine anotehr MA plot where you have the same  
situation, plus another few thousand spots, evenly distributed up or  
down, with as extreme values as you like...
Normalisation methods like loess simply try to determine what is "not  
changed": fit a regression curve and it will neatly follow along the  
M=0 line... It will do so in both cases indicated above. The question  
really is not simply that there are not many genes DE... if the % of  
DE genes is low, of course that makes things easier, as their  
contribution to the regression curve using all of the spots will be  
small. But you can have many DE genes and still be able to use loess  
perfectly happily.
You really have to observe the data, and have an idea of the biology  
of teh experiment to know what you are expecting (if the bulk of teh  
data is really not DE).

This is why it's so hard to recommend any way to normalise data just  
by looking at a plot... I'd say that in most experiments, a loess  
regression curve is good enough as a normalisation aid, and that's why  
people often use it with good results even when all the assumptions  
are not perfectly met, especially that of not having many DE genes.

the only sure way to normalise any set of data is to have a good set  
of control spots whose behaviour is known a priori. But one can often  
do without it and get reasonable results. Most of us do :)

> I think most of us need more time with the data in order to really
> give you any recommendations. You should seek out a local expert.

Good suggestion, and don't forget to explain the biology behind the  
experiment (i.e: the behaviour you expect, if known)

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.