[BioC] limma moderated t-statistics and B-statistics

Wed Sep 22 06:54:52 CEST 2004

This is to respond to a number of questions about the interpretation of the 
moderated t and B-statistics in limma. This will be a section of the Limma 
User's Guide in the next release.

Gordon
----------------------------------

Statistics for Differential Expression

A number of summary statistics are computed by the eBayes() function for 
each gene and each contrast. The M-value (M) is the log2-fold change, or 
sometimes the log2-expression level, for that gene. The A-value (A) is the 
the average expression level for that gene across all the arrays and 
channels. The moderated t-statistic (t) is the ratio of the M-value to its 
standard error. This has the same interpretation as an ordinary t-statistic 
except that the standard errors have been moderated across genes, 
effectively borrowing information from the ensemble of genes to aid with 
inference about each individual gene. The ordinary t-statistics are not 
usually required or recommended, but they can be recovered by

 > tstat.ord <- fit$coef / fit$stdev.unscaled / fit$sigma

after fitting a linear model. The ordinary t-statistic is on 
fit$df.residual degrees of freedom while the moderated t-statistic is on 
fit$df.residual+fit$df.prior degrees of freedom.

The p-value (p-value) is obtained from the moderated t-statistic, usually 
after some form of adjustment for multiple testing. The most popular form 
of adjustment is "fdr" which is Benjamini and Hochberg's method to control 
the false discovery rate. The meaning of the adjusted p-value is as 
follows. If you select all genes with p-value below a given value, say 
0.05, as differentially expression, then the expected proportion of false 
discoveries in the selected group should be less than that value, in this 
case less than 5%.

The B-statistic (lods or B) is the log-odds that that gene is 
differentially expressed. Suppose for example that B=1.5. The odds of 
differential expression is exp(1.5)=4.48, i.e, about four and a half to 
one. The probability that the gene is differentially expressed is 
4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is 
differentially expressed. A B-statistic of zero corresponds to a 50-50 
chance that the gene is differentially expressed. The B-statistic is 
automatically adjusted for multiple testing by assuming that 1% of the 
genes, or some other percentage specified by the user, are expected to be 
differentially expressed. If there are no missing values in your data, then 
the moderated t  and B statistics will rank the genes in exactly the same 
order. Even you do have spot weights or missing data, the p-values and 
B-statistics will usually provide a very similar ranking of the genes.

Please keep in mind that the moderated t-statistic p-values and the 
B-statistic probabilities depend on various sorts of mathematical 
assumptions which are never exactly true for microarray data. The 
B-statistics also depend on a prior guess for the proportion of 
differentially expressed genes. Therefore they are intended to be taken as 
a guide rather than as a strict measure of the probability of differential 
expression. Of the three statistics, the moderated-t, the associated 
p-value and the B-statistics, we usually base our gene selections on the 
p-value. All three measures are closely related, but the moderated-t and 
its p-value do not require a prior guess for the number of differentially 
expressed genes.

The above mentioned statistics are computed for every contrast for each 
gene. The eBayes() function computes one more useful statistic. The 
moderated F-statistic (F) combines the t-statistics for all the contrasts 
for each gene into an overall test of significance for that gene. The 
moderated F-statistic tests whether any of the contrasts are non-zero for 
that gene, i.e., whether that gene is differentially expressed on any 
contrast. The moderated-F has numerator degrees of freedom equal to the 
number of contrasts and denominator degrees of freedom the same as the 
moderated-t. Its p-value is stored as fit$F.p.value. It is similar to the 
ordinary F-statistic from analysis of variance except that the denominator 
mean squares are moderated across genes.

In a complex experiment with many contrasts, it may be desirable to select 
genes firstly on the basis of their moderated F-statistics, and 
subsequently to decide which of the individual contrasts are significant 
for those genes. This cuts down on the number of tests which need to be 
conducted and therefore on the amount of adjustment for multiple testing. 
The functions classifyTestsF() and decideTests() are provided for this purpose.