[BioC] distribution of agilent array data.‏

Mon Nov 11 19:04:31 CET 2013

Hello Chunuxuan,

I have worked with Agilent arrays a lot and can confirm Jim's comment the type of distribution you show (with a heavy right tail) is fairly typical. If you follow Jim's advice of smaller bin sizes/ more bins (nclass =100 or so) you will probably see that there is some mass of the distribution to left of the peak/mode (as you would expect from normexp).

I guess what might be confusing you is that the normalisation + logging is supposed to give you normally distributed data (or at least something not so very far away from it) which are symmetrically distributed. But this is a statement about the distribution for the replicates  WITHIN genes, not across genes.

Best Wishes

Claus

Dr. Claus-D. Mayer
Biomathematics & Statistics Scotland (BioSS)
Rowett Institute of Nutrition and Health
University of Aberdeen
Aberdeen AB21 9SB, Scotland, UK.
email: claus at bioss.ac.uk or c.mayer at abdn.ac.uk
Telephone: +44 (0) 1224 438652

Biomathematics and Statistics Scotland (BioSS) is formally part of The James Hutton Institute,
a  registered Scottish charity No. SC041796 and a company limited by guarantee No. SC374831

> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
> bounces at r-project.org] On Behalf Of shao chunxuan
> Sent: 11 November 2013 15:52
> To: James W. MacDonald
> Cc: bioconductor
> Subject: Re: [BioC] distribution of agilent array data.‏
>
> Hi Jim,
> Thanks for the helpful comments.
> I have this question partially because the data has bee been normalized
> by other people, in which the distribution is more or less symmetric.
> The alternative codes are:
> library(limma)targets <- readTargets("targets.txt")x <-
> read.maimages(targets, source="agilent",columns=list(R="gMedianSignal",
> Rb="gBGMedianSignal", G="gMedianSignal", Gb="gBGMedianSignal"))y.bg <-
> backgroundCorrect(x, method="normexp")eset <- y.bg$G ## log2 transform
> before normalization!!!eset.l <- round(log2(eset), 4)y.bgn.l.2 <-
> normalizeBetweenArrays(eset.l, method="quantile") I got a bell shape
> distribution for all probes in a single array if log2 transformed
> before normalization. It is an old question whether to log2 first, but
> in my data, it doesn't matter, I found that the boxplot for a single
> genes across patients are identical, and the signature genes can
> separated patients very well in both conditions.
>
> Best,
> Chunuxan
>
>
> > Date: Sun, 10 Nov 2013 19:45:43 -0500
> > From: jmacdon at uw.edu
> > To: hibergo at outlook.com
> > CC: bioconductor at r-project.org
> > Subject: Re: [BioC] distribution of agilent array data.þ
> >
> > Hi Chunxuan,
> >
> > On Sunday, November 10, 2013 3:19:44 PM, shao chunxuan wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi everyone,
> > >
> > > I am confused by the histogram of normalized Agilent microarray
> data.
> > > It is human single color array, containing around 700 microarrays
> and 43K probes.
> > >
> > > After normalization, I plotted the express value of all probes in
> single microarray, one example is attached.
> > >
> > > I
> > >   expected to see a more or less symmetric distribution, however,
> > > the values seems truncated. In the beginning I thought it may
> relate
> > > to offset value, but I have tried different value 16, 1, 0, still
> > > got similar distribution.
> >
> > Why would you expect a symmetric distribution? Also, plotting a
> > histogram with such large bin sizes isn't very helpful - I wouldn't
> be
> > willing to say much about the distribution based on that plot anyway.
> >
> > A more reasonable expectation is something like a convolution of a
> > lognormal and an exponential distribution. In other words, there are
> > likely a large number of genes that aren't expressed, and the
> > distribution of those probes will be symmetrical around some small
> > number. And the distribution of expressed genes is likely to be
> > something like an exponential, with a long right tail. And since you
> > used the normexp background correction, you made the same assumption
> > as well.
> >
> > Best,
> >
> > Jim
> >
> >
> > >
> > > Any explanation or suggestions?
> > >
> > > Here are codes for normalization:
> > > library(limma)
> > > targets <- readTargets("targets.txt") x <- read.maimages(targets,
> > > source="agilent",green.only=TRUE) y.bg <- backgroundCorrect(x,
> > > method="normexp") y.bgn <- normalizeBetweenArrays(y.bg,
> > > method="quantile") g.ex <- avereps(y.bgn, ID=y.bgn$genes$ProbeName)
> > > da.norm <- g.ex$E
> > >
> > > Here are R session:
> > > R version 3.0.2 (2013-09-25)
> > > Platform: x86_64-apple-darwin10.8.0 (64-bit)
> > >
> > > locale:
> > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> > >
> > > attached base packages:
> > > [1] graphics  grDevices utils     datasets  stats     methods
> base
> > >
> > > other attached packages:
> > > [1] ggplot2_0.9.3.1 reshape2_1.2.2  plyr_1.8
> > >
> > > loaded via a namespace (and not attached):
> > >   [1] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.3
> grid_3.0.2         gtable_0.1.2       labeling_0.2
> > >   [7] MASS_7.3-29        munsell_0.4.2      proto_0.3-10
> RColorBrewer_1.0-5 scales_0.2.3       stringr_0.6.2
> > > Best,
> > >
> > > chunxuan
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > --
> > James W. MacDonald, M.S.
> > Biostatistician
> > University of Washington
> > Environmental and Occupational Health Sciences
> > 4225 Roosevelt Way NE, # 100
> > Seattle WA 98105-6099
>
>       [[alternative HTML version deleted]]

The University of Aberdeen is a charity registered in Scotland, No SC013683.