[BioC] A question regarding the mean of M-values.

Fri Apr 29 04:20:11 CEST 2005

I know that "fold change" is an intuitive measure which non-mathematical 
users like to relate things back to. Unfortunately, taking arithmetic means 
of "fold changes" does not give sensible results. Here is a simple example 
to show why:

Suppose you are comparing a cell line in stimulated and unstimulated 
conditions, and you have two biological replicates. Suppose the first 
replicate gives you 10-fold up regulation in the simulated condition, and 
the second replicate is 10-fold down regulated. The only sensible 
conclusion here is that there is no systematic difference between the 
stimulated and unstimulated conditions, but that there is a lot of 
variability between the replicates. This is exactly what the log-ratio 
analysis would tell you.

On the other hand, if you average the fold changes, you get nonsense 
results. The two fold changes are:

10 and 1/10

so the "average fold change is a bit over 5. So you conclude that "on 
average" the stimulation produces 5-fold up regulation. This is nonsense.

Worse still, if you compute the fold changes the other way around, you make 
the opposite conclusion. A perfectly equivalent way to state the results 
would be to say that the first replicate is 10 fold down in the 
unstimulated condition and the second is 10 fold up. So the two fold change 
are:

1/10 and 10

so the "average" fold change is again a bit over 5. But now you conclude 
that the *unstimulated condition* gives a 5-fold change over the 
unstimulated condition. The is the opposite of what you concluded when you 
expressed the fold changes the other way around.

It is necessary to express the fold changes on a log-ratio scale, so that 
multiplicative changes become additive, before it makes any sense to take 
arithmetic averages. There are a lot of good statisticians in Stockholm -- 
why not have to talk to one of them about this?

Gordon

>Date: Thu, 28 Apr 2005 09:13:16 +0200
>From: "Johan Lindberg" <johanl at biotech.kth.se>
>Subject: RE: [BioC] A question regarding the mean of M-values.
>To: <bioconductor at stat.math.ethz.ch>
>
>
>Hi all.
>I have encountered the same problem. In LIMMA it is possible to handle
>two levels of replicates. You can use duplicateCorrelation for one level
>(technical replicates or duplicate spots) and use the rest as biological
>replicates to fit your model. But say that I have another level of
>replicates. I have replicate spots, technical replicates and biological
>replicates. I guess the right thing to do is to average over the
>replicate spots and use duplicate correlation for the technical
>replicates.
>Here I started wondering since limma, when calculating a contrast
>between two samples uses the arithmetic mean on the M-values which is
>the same as taking the geometric mean on the fold-changes and then
>taking the logarithm of that value, or ?!?
>
>Recall laws of logarithms:
>log(xy) = log(x) + log(y)
>log(x^n) = n*log(x)
>
>This means that if I take
>
>(log(M1)+log(M2)+log(M3))/3 this is the same as taking
>log((M1*M2*M3)^(1/3)) which is the same as taking the geometric mean on
>the fold changes and then taking the logarithm of that value.
>
>I wonder, can one motivate using geometric mean on expression data
>instead of arithmetic? See
>http://www.math.toronto.edu/mathnet/questionCorner/geomean.html
>for a nice tip of when to use what mean...
>
>For me is seem like one should, if you want to take a mean of M-values
>in an expression experiment, remove the logarithm, calculate the average
>fold change and them use the logarithm of desire on that value.
>
>Comments appreciated to a guy with limited math-skills being out on deep
>water....
>
>// Johan L
>
>
>
>-----Original Message-----
>From: bioconductor-bounces at stat.math.ethz.ch
>[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of marcus
>Sent: Wednesday, April 27, 2005 5:02 PM
>To: bioconductor at stat.math.ethz.ch
>Subject: [BioC] A question regarding the mean of M-values.
>
>
>Hello all users.
>
>I have a question regarding the mean calculations of the M-values in
>LIMMA.
>
>I guess that the fit$coeff is the mean of the M-values used for the
>linear
>model. The fit$coeff has the mean value of the data derived from a
>specific
>RNA source (as defined in the design matrix), and the value in
>fit$coeff[1]
>is the same as mean(MS[1,1:2]) (if I for example had Sample 1 on 2
>arrays in
>my matrix containing the data.
>
>So...if you take the mean of two values (in the log 2 scale), for
>example
>M = 8 and M = 1, the mean (and hence the fit$coef ?) will be 4,5.
>
>If you want to look at the foldchange I guess that 2^fit$coeff is
>correctly
>calculated, so for the example it will be 2^4,5 = 22,6 times
>upregulated.
>
>But if you look at the values independently, M=8 will give 2^8 = 256
>times,
>and 2^1 = 2 times upregulation. The mean of these values are (256 + 2) /
>2 =
>129 times.
>
>I know that the question is a bit naive, but how should one do when you
>take
>the mean of logarithms since the numbers are not related to each other
>as
>normal numbers are. E.g. the number 8 is not twice the size of 4 on a
>logarithmic scale, it is 10000 times more (on a log10 scale).
>
>So....how should one do, when I want to take the average of log values?
>Shouldn't I calculate the ratios back (not in log2 scale) and calculate
>the
>mean, and transform the data back, If I would like to have an average M
>value?
>
>Regards
>
>Marcus