[BioC] Limma: background correction. Use or ignore?

Wed Apr 5 15:38:34 CEST 2006

Hi,

I'm jumping in to this thread here.  I will try to comment on most of
the message, so there will be more following.  People will probably
disagree with me in some cases.

On 3/31/06, James W. MacDonald <jmacdon at med.umich.edu> wrote:
> Hi Jose,
>
> J.delasHeras at ed.ac.uk wrote:
> > I have been using LimmaGUI for a while to analyse my cDNA microarrays.
> > I have always used "substract" as a method for background correction.
> > Why? Not sure. Intuitively it made sense, and I didn't observe any
> > obvious problems.
> > Once I played with the different methods for background correction
> > available in LimmaGUI, and when looking at the MA plots I decided I
> > preferred to substract.
> >
> > However, I have recently had problems with the statistics being quite
> > poor in my analises (see my post a week ago or so about low B
> > values)... and whilst checking the data, I noticed that at least in my
> > current experiments, if I do no background correction at all the stats
> > look a lot better, the MA plots look better, and everything looks
> > better in general. The actual list of genes doesn't change a lot, but
> > the values seem a lot tighter.

A leading question: What do you mean by "MA plots look better"?  They
can look better in many ways, depending on what you are trying to
answer.  To simplify things very much, we have two cases of questions:

1) Find differentially expressed genes, that is, we are trying to test
the null hypothesis H0: mu=0, against H1: mu != 0, where mu is the
unknown log-ratio of the gene (in two samples).

2) Estimate the unknown log-ratio of the gene (in two sample), i.e.
estimate mu.  This may for instance be of interest in copy number
analysis.

In Case 1, it does not matter much if our *absolute* value of the mu
estimates are biased or not - we are still trying to identify those
away from zero.  In other words, if we rescale the estimates we will,
in theory, still be able to identify differentially expressed genes. 
This is what the variance stabilizing (VS) methods (Huber and Rocke &
Durbin) is making use of.

In Case 2, the unbiased estimates are by definition the quantities of
interest.  For this reason we cannot use for instance VS methods in
this case.  An exception is if you change your objection to identify,
say, genes with copy numbers 0, 1, 2, 3, ... then we will be able to
make a classification problem, and VS methods may still be valid.

> > This makes me question whether we should background correct at all. My
> > slides are pretty clean, low background. Am I not adding more noise to
> > the data by removing background?
>
> I have never been a big fan of subtracting background, especially if the
> background of the slide is low and relatively consistent. I have two
> main reasons for this.
>
> First, the portion of the slide used to estimate background doesn't have
> any cDNA bound, so you are estimating the background binding of the spot
> by using a portion of the slide that might not be very similar. When we
> were doing more spotted arrays, we would always spot unrelated cDNA on
> the slides as well (e.g., A.thaliana and salmon sperm DNA). These spots
> almost always had a negative intensity if you subtracted the local
> background, which indicates to me that cDNA does a better job of
> blocking the slide than BSA or other blocking agents.
>
> Second, you *are* adding more noise to the data. When you subtract, the
> variances are additive. However, if you don't subtract then you take the
> chance that you are biasing your expression values, especially if the
> background from chip to chip isn't relatively consistent. So the
> tradeoff is higher variance vs possible bias. If the background was
> consistent I usually took a chance on the bias in order to reduce the
> variance. As you note, the data usually look 'cleaner' if you don't
> adjust the background.

I totally agree with you, it [the log-ratio log-intensity scatter
plot] "looks" cleaner, but it does not necessarily mean it is better. 
Especially if one deals with Case 2 above, I normally say that, if you
do not see large variance in log-ratios at lower intensities, you are
doing something wrong.  This is of course not the full story - it
depends what methods you use down the stream.  However, I don't really
trust someone who compares two log-ratio log-intensity plots, points
at one of them, and says "I used this one because there is less
spread".

Hopefully not being too self-oriented, I would like to refer to
Bengtsson  & Hössjer, Methodological study of affine transformations
of gene expression data with proposed robust non-parametric
multi-dimensional normalization method BMCBioinfo, 2006, for more
details.  I also have quite a few talks on the topic at
http://www.maths.lth.se/bioinformatics/.  The VS papers address this
too, but much less explicit.

> Note that these points are directed towards simple subtraction of a
> local background estimate. Other more sophisticated methods may help
> address these shortcomings.

It is important to differentiate between true background and
background methods.  It is even more important to differentiate
between all types of background that can be introduced in the
microarray process.  It can be introduced at many places, e.g.
labelling, cross hybridization, dust, scanning, image analysis and so
on.  There is no single method that address all of them, and that is
important to understand/accept.

For instance, the paper Yang et al, Comparison of methods for image
analysis on cDNA microarray data JCompGraphStat, 2002, show that
different image-analysis methods estimate background differently. 
Thus, when we choose method, we introduce a bias (unless you're lucky
enough to hit the right one).  Similar conclusions can be drawn from
Bengtsson & Bengtsson, Microarray image analysis: background
estimation using quantile and morphological filters, BMCBioinfo, 2006.

Another example is scanner bias.  We found that both Axon and Agilent
scanners introduce a substantial offset in signals.  See Bengtsson et
al, Calibration and assessment of channel-specific biases in
microarray data with extended dynamical range, BMCBioinfo, 2004.  The
offset in both scanners was/is about 20 units on the range [0, 65535].
 It does not sound too much, but 20 is definitely enough to bias you
log-ratios.  We have seen similar effects in Affymetrix scanners. 
Afterwards, we have identified some models of the same brands, that
does not have such strong offset.  Thus, when we choose a scanner we
introduce bias.  I'll reply in another message how to estimate and
correct for this. It is easy.

We can of course argue that classical image-analysis background
correction methods correct for scanner bias too, i.e. y_fg = y +
y_scanner +eps and y_bg = y_scanner + xi => y_est = y_fg - y_bg = y +
eps'.  However, xi will probably introduce unnecessary variance, but
also bias.  See the Bengtsson & Bengtsson paper for the latter.

If we believe that features on the arrays can be contaminated by
non-wanted fluorescent molecules, then we have another source of
background, and so on.

Finally, consider the following retorical questions.  If we the accept
that there are scanner offsets, which I believe we have been able to
prove in the above paper, the it is very hard to argue that you should
not correct for background.  If we still argue that we should not
substract, then what if we are using two different scanner
brands/models for the same array and they introduce different offsets,
the we get into a contradiction.  In this way one can argue that it is
very strange if we do not need to correct for additive background,
whatever origin it has.

/Henrik

> As for references, have you looked at the references that Gordon gives
> on the man page for backgroundCorrect()? That would probably be a good
> place to start.
>
> Best,
>
> Jim
>
>
> >
> > Can anybody point me to a good reference to learn about the effects of
> > background correction, pros and cons? I'm just a molecular biologist,
> > not a statistician, but I need to understand a bit better these issues
> > or there'll be no molecular biology to work on from my experiments!
> >
> > Jose
> >
> >
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>