# [BioC] A question regarding the mean of M-values.

Johan Lindberg johanl at biotech.kth.se
Thu Apr 28 09:13:16 CEST 2005

```Hi all.
I have encountered the same problem. In LIMMA it is possible to handle
two levels of replicates. You can use duplicateCorrelation for one level
(technical replicates or duplicate spots) and use the rest as biological
replicates to fit your model. But say that I have another level of
replicates. I have replicate spots, technical replicates and biological
replicates. I guess the right thing to do is to average over the
replicate spots and use duplicate correlation for the technical
replicates.
Here I started wondering since limma, when calculating a contrast
between two samples uses the arithmetic mean on the M-values which is
the same as taking the geometric mean on the fold-changes and then
taking the logarithm of that value, or ?!?

Recall laws of logarithms:
log(xy) = log(x) + log(y)
log(x^n) = n*log(x)

This means that if I take

(log(M1)+log(M2)+log(M3))/3 this is the same as taking
log((M1*M2*M3)^(1/3)) which is the same as taking the geometric mean on
the fold changes and then taking the logarithm of that value.

I wonder, can one motivate using geometric mean on expression data
http://www.math.toronto.edu/mathnet/questionCorner/geomean.html
for a nice tip of when to use what mean...

For me is seem like one should, if you want to take a mean of M-values
in an expression experiment, remove the logarithm, calculate the average
fold change and them use the logarithm of desire on that value.

Comments appreciated to a guy with limited math-skills being out on deep
water....

// Johan L

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of marcus
Sent: Wednesday, April 27, 2005 5:02 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] A question regarding the mean of M-values.

Hello all users.

I have a question regarding the mean calculations of the M-values in
LIMMA.

I guess that the fit\$coeff is the mean of the M-values used for the
linear
model. The fit\$coeff has the mean value of the data derived from a
specific
RNA source (as defined in the design matrix), and the value in
fit\$coeff[1]
is the same as mean(MS[1,1:2]) (if I for example had Sample 1 on 2
arrays in
my matrix containing the data.

So...if you take the mean of two values (in the log 2 scale), for
example
M = 8 and M = 1, the mean (and hence the fit\$coef ?) will be 4,5.

If you want to look at the foldchange I guess that 2^fit\$coeff is
correctly
calculated, so for the example it will be 2^4,5 = 22,6 times
upregulated.

But if you look at the values independently, M=8 will give 2^8 = 256
times,
and 2^1 = 2 times upregulation. The mean of these values are (256 + 2) /
2 =
129 times.

I know that the question is a bit naive, but how should one do when you
take
the mean of logarithms since the numbers are not related to each other
as
normal numbers are. E.g. the number 8 is not twice the size of 4 on a
logarithmic scale, it is 10000 times more (on a log10 scale).

So....how should one do, when I want to take the average of log values?
Shouldn't I calculate the ratios back (not in log2 scale) and calculate
the
mean, and transform the data back, If I would like to have an average M
value?

Regards

Marcus

Marcus Gry Björklund

Royal Institute of Technology
AlbaNova University Center
Department of Molecular Biotechnology
106 91 Stockholm, Sweden

Phone (office): +46 8 553 783 44
Fax: + 46 8 553 784 81
Visiting address: Roslagstullsbacken 21, Floor 3
Web: http://www.biotech.kth.se/molbio/microarray/index.html
-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Marcus Davy
Sent: Wednesday, April 27, 2005 15:41
To: ramasamy at cancer.org.uk; naomi at stat.psu.edu; saurin_jani at yahoo.com
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] LIMMA : design (1, 2, 3, 3 ) , I gotEXCITINGresults,
what could be the logic, since i h

Im another in agreement that you probably need to increase your
biological
replication of microarray slides. Any microarray facility has a certain
experimental sensitivity. If the amount of information you have is low
then
the actual distribution of the observed proportion of false discoveries
under simulation can become highly variable as pi_0, the proportion of
truely equilalently expressing genes tends to 1 (depending on your
cutoff
alpha level), remembering that FDR control is an expectation for any
realization of an experiment. So your list of differentially expressing
genes could contain many (or few) false discoveries but in the long run
should maintain control at ~pi_0*alpha under nonadaptive FDR control.

Saurin, you mention that you have a large list of differentially
expressing
genes using a nonadaptive FDR cutoff of alpha=0.01. As an alternative
approach you could analyze some of the chip comparisons of interest
using
EBarrays (after checking distribution assumptions), which models the
entire
exprSet using hierarchal Gamma Gamma Bernoulli, or Log Normal Normal
Bernoulli models, modelling the null and alternate distributions as a
mixture with unknown mixing proportion p. The advantage of EBarrays is
that
it can analyse a single cDNA two channel array or two affymetrix single
channel arrays, providing reasonable estimates of pi_0. The same problem
applies though with a lack of sensitivity without biological replication
(buts it better than fold change at least). You could see *if* EBarrays
predicts a value of pi_0 which is moderately less than 1 for comparisons
of
interest.

Marcus

>>> Naomi Altman <naomi at stat.psu.edu> 27/04/2005 4:55:11 a.m. >>>
Significance should be based on biological replication.  If the 2 chips
for
group 3 are technical replicates, then the variance estimate for the
test
is probably too small.

In theory, statistical tests need only 2 replicate in a single
condition,
as the null distribution accounts for the number of replicates.
However,
for this theory to hold, the normality of the samples must be pretty
good.  When the data are exactly normally distributed (and the
assumptions
for limma for the distribution of variance hold) then the FDR values
should
be pretty good, but the FNR will be poor (as you have no power).

However, I don't think anyone believes that microarray data are normally

distributed.  So, I would not really trust these results, even if you
have
a biological replicate.  Of course the 2-fold rule is even worse, so
really
you should do more biological replication.

--Naomi

At 09:51 PM 4/26/2005, Saurin Jani wrote:
>
>Yes, you are right. I have 4 samples :
>
>Group1 = Growth Effect for Day 1 : 1 Affy GeneChip.
>Group2 = Growth Effect for Day 2 : 1 Affy GeneChip.
>Group3 = Growth Effect for Day 3 : 2 Affy GeneChips.
>
>so, my design matrix is:
>design <- model.matrix(~ -1+factor(c(1,2,3,3)));
>
>LIMMA did not give any error or waring even it has 1
>sample per group...! ( I thought similar thing,  since
>it needs technical replicates per group to make a
>decision). The results are very interesting. I have
>many genes for 0.01 FDR, which is very good.
>
>Somehow,I don't understand the logic. Do you think is
>this a valid design? Or You think I should go by Fold
>Change Logic. Please, let me know.
>
>Thank you very much,
>Saurin
>
>
>
>
>
>--- Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>wrote:
> > PLEASE correct me if I am wrong.
> >
> > You have a total of 4 samples that could be
> > classified into one of 3
> > groups ? How do you plan on distinguishing
> > biological from technical
> > variation ? Shouldn't limma come with some sort of
> > warning or error if
> > there are only one sample per group ?
> >
> >
> >
> >
> > On Tue, 2005-04-26 at 10:01 -0700, Saurin Jani
> > wrote:
> > > Hi BioC,
> > >
> > > I have 3 groups but I have only 2 replicates for
> > last
> > > group. so, group 1 and 2 has only one Affy CEL
> > file. I
> > > Did..LIMMA as below and I got some Exciting
> > results:
> > >
> > > #----------------------------------
> > > design <- model.matrix(~ -1+factor(c(1,2,3,3)));
> > > colnames(design) <-  c("g1","g2","g3");
> > > fit <- lmFit(myRMA,design);
> > >
> > > contrast.matrix <-
> > > makeContrasts(g1-g2,g1-g3,g2-g3,levels = design);
> > >
> > > fit2 <- contrasts.fit(fit,contrast.matrix);
> > > fit2 <- eBayes(fit2);
> > >
> > > results <-
> > >
> > > myGenes <- geneNames(myRMA);
> > > i <- apply(results,c(1,2),all);
> > >
> > > a <- i[,1];
> > > b <- i[,2];
> > > c <- i[,3];
> > > tempgenes1 <- myGenes[a];
> > > tempgenes2 <- myGenes[b];
> > > tempgenes3 <- myGenes[c];
> > >
> > > tempall <- c(tempgenes1,tempgenes2,tempgenes3);
> > > myDEGenes <- tempall;
> > >
> > > esetSub2X <- MatrixRMA[myDEGenes,];
> > > esetSub2 <- new("exprSet",exprs = esetSub2X);
> > > pData(esetSub2) <- pData(myRMA);
> > > heatmap(esetSub2X);
> > > #----------------------------------
> > >
> > > I got EXCITING results, what could be the
> > logic,since
> > > i have 2 replicates for 3rd group only ?
> > >
> > > Could anyone point me out ?
> > >
> > > I highly appreciate your help , Thank you in
> > >
> > > Thank you,
> > > Saurin
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >
> >
> >
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor

______________________________________________________

The contents of this e-mail are privileged and/or\ confident...{{dropped}}

```