[BioC] normalisation assumptions (violation of)

Henrik Bengtsson hb at maths.lth.se
Mon Aug 7 18:44:09 CEST 2006


On 8/7/06, J.delasHeras at ed.ac.uk <J.delasHeras at ed.ac.uk> wrote:
> Quoting Sean Davis <sdavis2 at mail.nih.gov>:
>
> [...]
> > You can certainly try loess and see how the result looks, as scatterplots
> > are notorious for "hiding" where the data are most dense.  Alternatively,
> > you could try "rotating" the scatterplot until the body of the data is where
> > you think it should be--I don't know if there is a method in Bioconductor
> > that does this, though.
> >
> > Sean
>
> Thanks Sean.
>
> I already tried loess, and this is the MA plot for the first set of
> data looks like this:
>
> http://mcnach.com/MISC/MAplots2.png
>
> which looks okay to me. You see the ascending diagonal is denser, which
> contains all those newly activated spots. I knew a few genes that were
> expected to be there (from RT data) and they line up nicely on that
> diagonal.

This MA plot indicates that the noise levels have become assymetric
after curve-fit normalization.  I say so, because your data is
"bending" upwards instead of being a nice flat line, cf. Frame 33 of
48 in http://www.maths.lth.se/bioinformatics/calendar/20051108/.  If
this is true, your tests down the stream might not work that well.

>
> This was without substracting background.
> When I attempted to correct for background I run into problems. Mainly
> because some slides have a higher bkg than usual, and the signal is
> lower than the local bkg for a good number of spots. When I use

You haven't told us your platform.  What type of scanner do you use?

> "subtract" as a bkg correction method, it results in many negative
> intensities, and those spots are removed. I then tried "half" to

I would say that this is expected for signals around zero (on the
intensity scale); if you have no biological signals it is a 50-50
chance if the background is stronger than the foreground.  The problem
is how to deal with those.  Also, do NOT be afraid of the large noise
levels at lower intensities; you do expect to see these when your
signals get closer to noise levels (closer to zero).  If you want to
stabalize the variance structure there are methods for this, but then
you pay the price of loosing accuracy (you get biased log-ratio
estimates).

> overcome this, so that negative values are turned into an arbitrary
> 0.5... and this totally flattened the MA plot, and nothing was

Yes, 0.5 is very arbitrary.  Why not 5, 0.05, or 0.0000000000005?
You might want to look into Kooperberg's background correction
methods, or the ones in limma.

> statistically DE. I showed this on a previous thread:
>
> http://mcnach.com/MISC/MAplots1.png
>
> It's very striking. It leaves me no other choice but not removing
> background (which is increasingly looking like the best option in
> general, in my still short experience...)

You haven't told us your platform.  What scanner do you have?  You
might have an offset in your scanner (quite commonly added to avoid
that analogue negative signals are truncated to zero), e.g. Axon and
Agilent introduce about 20-25 units (which is significant).  With a
simple scan protocol it is easy to check if your scanner introduce
offset.  The method is described in

H. Bengtsson, G. Jönsson and J. Vallon-Christersson, Calibration and
assessment of channel-specific biases in microarray data with extended
dynamical range, BMC Bioinformatics, 2004, 5:177.

and the estimatation and calibration methods are in aroma.light.  The
scanner offset is a global constant which means that you only fit a
single parameter per channel.  That is, subtracting this "background"
from the foreground signals does not introduce as much noise as if you
would subtract the image-analysis estimated backgrounds unique to each
spot.  This will leave you with less (probably no) non-positive
signals.  It might also be enough to remove the curvature seen in your
raw MA plots.  If so, your remaining problem will be how to estimate
the overall relative scale factor between the two channels, which is
only one parameter; it should be easier than using non-parametric
curve-fit methods.

I would also like to encourage you to read up on what affine
transformations (offset plus rescaling) can do to your data and
especially your MA plots;

H. Bengtsson and O. Hössjer, Methodological study of affine
transformations of gene expression data with proposed robust
non-parametric multi-dimensional normalization method, BMC
Bioinformatics, 2006, 7:100.

When you understand the bits and pieces of what's going on there you
will also be much more careful when you pick your normalization
method.  If would say that curve-fit (loess, lowess, spline, ...)
normalization is often overkill and corrects for a symptome rather
than fixing the underlying problem.  Quantile normalization can be
interpreted as a non-parametric method that corrects for affine
transformations, but it has a problem at the lower and higher
intensities.  Variance stabilization methods (Rocke & Durbin, W Huber)
have an explicit affine component in there models so they are much
more suited to this type of transform. Plain affine normalization
(aroma.light) corrects for affine transformation without controlling
for variance (on purpose).  The estimatation methods also differ
between the latter two approaches.

I hope this is a good start.

Cheers

Henrik


> Jose
>
> --
> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
> Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
> Swann Building, Mayfield Road
> University of Edinburgh
> Edinburgh EH9 3JR
> UK
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list