[BioC] quantile normalization approach

Sat Mar 22 16:04:21 MET 2003

Dear Rafael,

Thanks for your reply.  

> > Quantile-quantile normalization assumes common distribution for data
sets to
> > be normalized. I am fine with replicate normalization using this.
However,
> > for different experiments, such as data from different tissues, is the
> > assumption still valid?
> 
> probably not. but when replicate arrays have completely
> different distributions, in my opinion one is left with with no choice but
> to make such assumaptions. are you willing to make the assumption they all
> have the same median? how about the same quartiles? where to draw the line
> is not easy.

I agree with you. Here is my thought, for replicates (for example, three
chips from one sample preparation), It is probably a valid assumption (of
course you have to get rid of problematic chip first, for example chip that
have scratches,) even one replicate is 2 times brighter than the other. The
sample variation has less effect here.

However for samples from different tissues, it is hard to believe this is
true. It is very possible that the samples belong to the same type of
distribution, however with different mean and variance(I look many QQplot
from different experiments).  Of course genes with obvious expression
changes (biological relevant) usually are minority for a huge data set. It
is probably still OK to have that assumption. I just try to find out whether
there are rigorous comparisons (vs an assumption). 

> > Could anybody point me to some reference that conducts comparison under
many
> > different experimental conditions? (for example, under >10 different
tissues
> > or cell line experiments).  I read all the papers/ documents I can find.
But
> > still not convinced we can use that assumption.
> 
> both RMA papers (Biostatistics and NAR) apply the method to the diltion
> data set that has liver and central nervous system cell lines.

I read these papers. They are very good papers and well-written. For the
dilution data with the same background, it can help to understand replicate
normalization and something understanding of sample variation. However to
understand issues across different samples (totally different background)
two cell lines may not be enough (of course, this two cell lines seems
carefully chosen). I am thinking something like known amount of spike-ins
before/after sample preparation in many different tissue/cell line
background would give a better understanding. It will address variations
caused by chips, sample preps as well as different sample background
complexity. The last one is probably more biological relevant.

Does this make sense at all?

What in your opinion is the best normalization method so far? 

Regards

-h

	[[alternate HTML version deleted]]