[BioC] Re: quantile normalization vs. data distributions

Leslie Cope cope at jhu.edu
Tue Mar 16 18:58:44 MET 2004


The tests already take sample size into account, which is part of the
problem.
If two datasets really come from the exact same distribution, then as
sample size increases, histograms, density plots, summary statistics and
so on
will get closer and closer to one another.  The tests take this into account.

This becomes a problem in our case because we know that even with the
large number
of genes on a chip, there are differences in distribution from chip to
chip.  Some of these
differences don't matter for quantile normalization.  For example a simple
difference in
means would obviously not be a problem for quantile normalization.  Nor
would a simple
difference in variance.  These and more complicated differences between
distribution can
be accounted for when building tests, but the standard tests themselves
are blind and can't
tell distributional differences we care about from those we don't.

And for that matter, it is evident from recent discussion in this forum
that no
one is sure which differences we should care about and which don't matter.
Trying to figure out is the whole point of this thread.
Because of that I suspect that you will not get a nice clean answer to
your first
question at this time.

Leslie Cope, Ph.D.
Oncology Biostatistics, JHU


> 2. As a non-statistician I'm a bit confused that statistical test will
> nearly
> always find a significant difference between distributions when the
> samples
> are large (I remember someone mentioned this to me - without explanations
> -
> about 2 years ago in a posting to the R-list). Is there a way to
> "normalize"
> the test results (e.g. the p-values) by the size of the sample?
>
> I guess such a significant difference as reported by a test is a *real*
> difference (otherwise all statistical test would be worthless ...). Can
> one
> assume, that even if the two distributions are statistically different,
> one
> can treat them as equal judged by visuall investigatigation of a density
> plot
> or histogram?



More information about the Bioconductor mailing list