[BioC] quantile normalization approach

Laurent Gautier laurent at cbs.dtu.dk
Sun Mar 23 03:20:16 MET 2003

Dear Hui,

I have few comments too (inserted in your previous posts).

On Sat, Mar 22, 2003 at 04:04:21PM -0800, Wang, Hui wrote:
> Dear Rafael,
> Thanks for your reply.  
> > > Quantile-quantile normalization assumes common distribution for data
> sets to
> > > be normalized. I am fine with replicate normalization using this.
> However,
> > > for different experiments, such as data from different tissues, is the
> > > assumption still valid?
> > 
> > probably not. but when replicate arrays have completely
> > different distributions, in my opinion one is left with with no choice but
> > to make such assumaptions. are you willing to make the assumption they all
> > have the same median? how about the same quartiles? where to draw the line
> > is not easy.
> I agree with you. Here is my thought, for replicates (for example, three
> chips from one sample preparation), It is probably a valid assumption (of
> course you have to get rid of problematic chip first, for example chip that
> have scratches,) even one replicate is 2 times brighter than the other. The
> sample variation has less effect here.
> However for samples from different tissues, it is hard to believe this is
> true. It is very possible that the samples belong to the same type of
> distribution, however with different mean and variance(I look many QQplot
> from different experiments).  Of course genes with obvious expression
> changes (biological relevant) usually are minority for a huge data set. It
> is probably still OK to have that assumption. I just try to find out whether
> there are rigorous comparisons (vs an assumption). 

I am completely on your side about the underlying assumptions for what
I would call 'distribution driven transformation methods'. While using
that, one clearly assumes that on the biological side of the story only
very few genes are differentially expressed across the different experiments.
If one has any reason to suspect that it not the case(*), those normalization
method are to be used with care. The method 'invariantset' could make you
feel more confident for such cases. However, it does not necessarily mean 
that these normalisation methods are not acceptable for such cases. I did
run one of them(**) on data from different tissues, and I had a good
surprise when looking at a matrix of scatter plots for the probe level
intensities. The difference of tissues could be observed visually.
But, naturally a more in-depth study of these normalization methods for
these cases would be needed. Doing a spike-in of thousands of genes
is obviously not the thing to do, but I remember seeing a draft of paper
on a web site that used a very clever idea: using the mRNA from two different
tissues, a third condition was created by mixing RNA from the two tissues.
The first name on the draft was William J Lemon (whose email cannot be found
in my messy ${HOME} at the moment), he may have other suggestions too... 

(*): like comparing cells from different tissues as you mentioned, or may be
studies of dividing/resting cells, or di-auxic shift, or reaction to heat shock, or healthy/infected cells... 
(**): can't remember which one it was now.. quantiles, qspline, else ?

Hopin' it helps,


> > > Could anybody point me to some reference that conducts comparison under
> many
> > > different experimental conditions? (for example, under >10 different
> tissues
> > > or cell line experiments).  I read all the papers/ documents I can find.
> But
> > > still not convinced we can use that assumption.
> > 
> > both RMA papers (Biostatistics and NAR) apply the method to the diltion
> > data set that has liver and central nervous system cell lines.
> I read these papers. They are very good papers and well-written. For the
> dilution data with the same background, it can help to understand replicate
> normalization and something understanding of sample variation. However to
> understand issues across different samples (totally different background)
> two cell lines may not be enough (of course, this two cell lines seems
> carefully chosen). I am thinking something like known amount of spike-ins
> before/after sample preparation in many different tissue/cell line
> background would give a better understanding. It will address variations
> caused by chips, sample preps as well as different sample background
> complexity. The last one is probably more biological relevant.
> Does this make sense at all?
> What in your opinion is the best normalization method so far? 
> Regards
> -h
> 	[[alternate HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

currently at the National Yang-Ming University in Taipei, Taiwan
Laurent Gautier			CBS, Building 208, DTU
PhD. Student			DK-2800 Lyngby,Denmark	
tel: +45 45 25 24 89		http://www.cbs.dtu.dk/laurent

More information about the Bioconductor mailing list