[BioC] different density

Henrik Bengtsson hb at stat.berkeley.edu
Wed Dec 19 21:35:09 CET 2007


On 19/12/2007, Naomi Altman <naomi at stat.psu.edu> wrote:
> A plot that is often quite informative is log(exprs) vs log(exprs)
> for the unnormalized probes from replicate arrays (or just log(pm) vs
> log(pm)) .  If the arrays are "good" the technical replicates have
> high correlation and are tightly clustered on the diagonal of this
> plot, and biological replicates are shaped more like an American
> football - not a bit more pointy at the extremes than an ellipse.
>
> Bad arrays are either much more scattered, do not show a diagonal
> trend or may be jammed into the upper or lower section of the plot.

I can agree with saturation effects (and partly amount of scatter),
but *absolutely not* such things as non-linear discrepancies away from
diagonal on the logarithmic scale.

If you plot data in (log(y1), log(y2)) and see nonlinearities, that is
very often due to the simple fact that you have taken the logarithmic
transform on signals that got a bit of offset ("background").  If you
instead plot (y1,y2) you'll often find that the data lie on a nice
straight line.  The curvature comes from the fact that the line you
can fit through the data cloud does not pass through the origin (0,0).

It is more common to discuss the above effects in a log-ratio
log-intensity plot, that is, rotate the data to (A,M) where that
M=log2(y1/y2) and A=log2(y1*y2)/2.  Then data "should be" along M=0,
but the offset and the logarithmic transform will make it bend like a
banana.  Roughly the same way as in (log(y1), log(y2)) just rotated
and rescaled.

Now to the apparent noise levels in log-ratios:  Even if you have no
banana shape in (A,M) you can still have offset in the data.  This
happens when the offset is effectively the same (when taking into
account differences in scales).  You can easily try this yourself.
Take data that is nice and straight along the M=0 line in an M vs A
plot.  Then, go back to the intensity scale and add the same offset
'a' (say a=500) to both channels, i.e. y1' <- y1 + a and y2' <- y2 +
a, and calculate M'=log2(y1'/y2') and A'=log2(y1'*y2')/2.  When you
plot (A',M') the data is still straight and along M'=0.  However, we
do know there is offset because we added it!  Ok, even "worse" if you
look at the spread of {M'} compared with the spread of {M}, you'll
find that M' is much "cleaner" - when you increase 'a' it goes from
being a "funnel", to a "American football", to a "lentil", and finally
it will be sucked up in a "black hole".

In other words, evaluating quality by looking at the variance in M is
dangerous and deceptive, if you're not careful.  If you think about
it, in the perfect world without offset but with noise, you're
log-ratios will/should have infinite variance for signals close to
zero, e.g. "log2(0/0)".  (How to deal with this fact is a different
issue).

To summarize, don't throw out samples/arrays just because their
(log(y1),log(y2)) or (A,M) plots look like a banana, or if their
log-ratios (M) blow up at lower log-intensities (A).  Such effect can
be fixed by using the *correct* calibration/normalization.  Microarray
experiments still cost money and RNA/DNA might be scarce.

In order to stop myself from ranting more about this here, please read
the following instead:

H. Bengtsson and O. Hössjer, Methodological study of affine
transformations of gene expression data with proposed robust
non-parametric multi-dimensional normalization method, BMC
Bioinformatics, 2006, 7:100.
http://www.biomedcentral.com/1471-2105/7/100/

(It got references to other papers also dealing with this problem,
although they are less explicit about it)

Cheers

Henrik

>
> --Naomi
>
>
> At 05:30 PM 12/18/2007, Jakub Mieczkowski wrote:
> >First of all thank you very much for response.
> >Unfortunately I don't understand what do you mean that I should look
> >closely. I've got only .CEL files and I have no idea what else I can do.
> >QCReport is available here:
> >
> >http://students.mimuw.edu.pl/%7Ejm214641/AffyQCReport.pdf
> >
> >On RLE and RNAdeg plots I can't distinguish 4 "outliers" from rest.
> >
> >How can I check what was measured (background or signal)? Should I use
> >P/M/A method or something different? Are there any other Quality Control
> >methods than QCReport, RLE, NUSE and image analysis (residuals,
> >weigths). Maybe, in this situation, some pre-processing methods are
> >better than another? Maybe linear transformation can help?
> >Thank You,
> >Kuba
> >
> >Sean Davis pisze:
> > >
> > >
> > > On Dec 17, 2007 5:28 PM, Jakub Mieczkowski <kubamieczkowski at op.pl
> > > <mailto:kubamieczkowski at op.pl>> wrote:
> > >
> > >     Hi All,
> > >     I'm new to Bioconductor and I want to analyse time course data (6 time
> > >     points, 3 oligo arrays in each). During the quality control
> > (QCReport) I
> > >     found that 4 arrays have different densities. What is shown here:
> > >
> > >     http://students.mimuw.edu.pl/~jm214641/BoxANDden.pdf
> > >     <http://students.mimuw.edu.pl/%7Ejm214641/BoxANDden.pdf>
> > >
> > >     Plot of NUSE shows differences too. Images of weights are a little bit
> > >     different form rest, but I can't notice any artefacts.
> > >     3 of them, are from the same time point.
> > >
> > >     Should I remove them from further analysis (differences can have
> > >     biological basis)? Or maybe I just can't use methods like RMA (because
> > >     of different distributions)? Do you have any suggestions?
> > >
> > >
> > > Hi, Kuba.  You will probably need to look closely at the QC information
> > > on these arrays, but I would be concerned that these arrays didn't work
> > > for one reason or another given the much lower intensities associate
> > > with your four "outlier arrays".  I do not think I would blindly apply
> > > RMA to those arrays without getting a better sense of whether or not
> > > they are measuring something and not just representing mostly background
> > > signal.
> > >
> > > Sean
> > >
> > >
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list