[BioC] different density

Wed Dec 19 23:02:46 CET 2007

Hi folks,
I did not mean that one should look at 
nonlinearity of the main trend.  If the RNA is 
really bad, the scatter can either fill the 
entire rectangle or the data on one array is up 
against the lower or upper boundary of the 
plot.  Curvature should be fixable by normalization.

Sorry for the misunderstanding.

--Naomi

At 03:35 PM 12/19/2007, Henrik Bengtsson wrote:
>On 19/12/2007, Naomi Altman <naomi at stat.psu.edu> wrote:
> > A plot that is often quite informative is log(exprs) vs log(exprs)
> > for the unnormalized probes from replicate arrays (or just log(pm) vs
> > log(pm)) .  If the arrays are "good" the technical replicates have
> > high correlation and are tightly clustered on the diagonal of this
> > plot, and biological replicates are shaped more like an American
> > football - not a bit more pointy at the extremes than an ellipse.
> >
> > Bad arrays are either much more scattered, do not show a diagonal
> > trend or may be jammed into the upper or lower section of the plot.
>
>I can agree with saturation effects (and partly amount of scatter),
>but *absolutely not* such things as non-linear discrepancies away from
>diagonal on the logarithmic scale.
>
>If you plot data in (log(y1), log(y2)) and see nonlinearities, that is
>very often due to the simple fact that you have taken the logarithmic
>transform on signals that got a bit of offset ("background").  If you
>instead plot (y1,y2) you'll often find that the data lie on a nice
>straight line.  The curvature comes from the fact that the line you
>can fit through the data cloud does not pass through the origin (0,0).
>
>It is more common to discuss the above effects in a log-ratio
>log-intensity plot, that is, rotate the data to (A,M) where that
>M=log2(y1/y2) and A=log2(y1*y2)/2.  Then data "should be" along M=0,
>but the offset and the logarithmic transform will make it bend like a
>banana.  Roughly the same way as in (log(y1), log(y2)) just rotated
>and rescaled.
>
>Now to the apparent noise levels in log-ratios:  Even if you have no
>banana shape in (A,M) you can still have offset in the data.  This
>happens when the offset is effectively the same (when taking into
>account differences in scales).  You can easily try this yourself.
>Take data that is nice and straight along the M=0 line in an M vs A
>plot.  Then, go back to the intensity scale and add the same offset
>'a' (say a=500) to both channels, i.e. y1' <- y1 + a and y2' <- y2 +
>a, and calculate M'=log2(y1'/y2') and A'=log2(y1'*y2')/2.  When you
>plot (A',M') the data is still straight and along M'=0.  However, we
>do know there is offset because we added it!  Ok, even "worse" if you
>look at the spread of {M'} compared with the spread of {M}, you'll
>find that M' is much "cleaner" - when you increase 'a' it goes from
>being a "funnel", to a "American football", to a "lentil", and finally
>it will be sucked up in a "black hole".
>
>In other words, evaluating quality by looking at the variance in M is
>dangerous and deceptive, if you're not careful.  If you think about
>it, in the perfect world without offset but with noise, you're
>log-ratios will/should have infinite variance for signals close to
>zero, e.g. "log2(0/0)".  (How to deal with this fact is a different
>issue).
>
>To summarize, don't throw out samples/arrays just because their
>(log(y1),log(y2)) or (A,M) plots look like a banana, or if their
>log-ratios (M) blow up at lower log-intensities (A).  Such effect can
>be fixed by using the *correct* calibration/normalization.  Microarray
>experiments still cost money and RNA/DNA might be scarce.
>
>In order to stop myself from ranting more about this here, please read
>the following instead:
>
>H. Bengtsson and O. Hössjer, Methodological study of affine
>transformations of gene expression data with proposed robust
>non-parametric multi-dimensional normalization method, BMC
>Bioinformatics, 2006, 7:100.
>http://www.biomedcentral.com/1471-2105/7/100/
>
>(It got references to other papers also dealing with this problem,
>although they are less explicit about it)
>
>Cheers
>
>Henrik
>
> >
> > --Naomi
> >
> >
> > At 05:30 PM 12/18/2007, Jakub Mieczkowski wrote:
> > >First of all thank you very much for response.
> > >Unfortunately I don't understand what do you mean that I should look
> > >closely. I've got only .CEL files and I have no idea what else I can do.
> > >QCReport is available here:
> > >
> > >http://students.mimuw.edu.pl/%7Ejm214641/AffyQCReport.pdf
> > >
> > >On RLE and RNAdeg plots I can't distinguish 4 "outliers" from rest.
> > >
> > >How can I check what was measured (background or signal)? Should I use
> > >P/M/A method or something different? Are there any other Quality Control
> > >methods than QCReport, RLE, NUSE and image analysis (residuals,
> > >weigths). Maybe, in this situation, some pre-processing methods are
> > >better than another? Maybe linear transformation can help?
> > >Thank You,
> > >Kuba
> > >
> > >Sean Davis pisze:
> > > >
> > > >
> > > > On Dec 17, 2007 5:28 PM, Jakub Mieczkowski <kubamieczkowski at op.pl
> > > > <mailto:kubamieczkowski at op.pl>> wrote:
> > > >
> > > >     Hi All,
> > > >     I'm new to Bioconductor and I want to 
> analyse time course data (6 time
> > > >     points, 3 oligo arrays in each). During the quality control
> > > (QCReport) I
> > > >     found that 4 arrays have different densities. What is shown here:
> > > >
> > > >     http://students.mimuw.edu.pl/~jm214641/BoxANDden.pdf
> > > >     <http://students.mimuw.edu.pl/%7Ejm214641/BoxANDden.pdf>
> > > >
> > > >     Plot of NUSE shows differences too. 
> Images of weights are a little bit
> > > >     different form rest, but I can't notice any artefacts.
> > > >     3 of them, are from the same time point.
> > > >
> > > >     Should I remove them from further analysis (differences can have
> > > >     biological basis)? Or maybe I just 
> can't use methods like RMA (because
> > > >     of different distributions)? Do you have any suggestions?
> > > >
> > > >
> > > > Hi, Kuba.  You will probably need to look closely at the QC information
> > > > on these arrays, but I would be concerned that these arrays didn't work
> > > > for one reason or another given the much lower intensities associate
> > > > with your four "outlier arrays".  I do not think I would blindly apply
> > > > RMA to those arrays without getting a better sense of whether or not
> > > > they are measuring something and not just 
> representing mostly background
> > > > signal.
> > > >
> > > > Sean
> > > >
> > > >
> > >
> > >_______________________________________________
> > >Bioconductor mailing list
> > >Bioconductor at stat.math.ethz.ch
> > >https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >Search the archives:
> > >http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > Naomi S. Altman                                814-865-3791 (voice)
> > Associate Professor
> > Dept. of Statistics                              814-863-7114 (fax)
> > Penn State University                         814-865-1348 (Statistics)
> > University Park, PA 16802-2111
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111