[BioC] limma moderated t-statistics and B-statistics
Gordon Smyth
smyth at wehi.edu.au
Wed Sep 22 06:54:52 CEST 2004
This is to respond to a number of questions about the interpretation of the
moderated t and B-statistics in limma. This will be a section of the Limma
User's Guide in the next release.
Gordon
----------------------------------
Statistics for Differential Expression
A number of summary statistics are computed by the eBayes() function for
each gene and each contrast. The M-value (M) is the log2-fold change, or
sometimes the log2-expression level, for that gene. The A-value (A) is the
the average expression level for that gene across all the arrays and
channels. The moderated t-statistic (t) is the ratio of the M-value to its
standard error. This has the same interpretation as an ordinary t-statistic
except that the standard errors have been moderated across genes,
effectively borrowing information from the ensemble of genes to aid with
inference about each individual gene. The ordinary t-statistics are not
usually required or recommended, but they can be recovered by
> tstat.ord <- fit$coef / fit$stdev.unscaled / fit$sigma
after fitting a linear model. The ordinary t-statistic is on
fit$df.residual degrees of freedom while the moderated t-statistic is on
fit$df.residual+fit$df.prior degrees of freedom.
The p-value (p-value) is obtained from the moderated t-statistic, usually
after some form of adjustment for multiple testing. The most popular form
of adjustment is "fdr" which is Benjamini and Hochberg's method to control
the false discovery rate. The meaning of the adjusted p-value is as
follows. If you select all genes with p-value below a given value, say
0.05, as differentially expression, then the expected proportion of false
discoveries in the selected group should be less than that value, in this
case less than 5%.
The B-statistic (lods or B) is the log-odds that that gene is
differentially expressed. Suppose for example that B=1.5. The odds of
differential expression is exp(1.5)=4.48, i.e, about four and a half to
one. The probability that the gene is differentially expressed is
4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is
differentially expressed. A B-statistic of zero corresponds to a 50-50
chance that the gene is differentially expressed. The B-statistic is
automatically adjusted for multiple testing by assuming that 1% of the
genes, or some other percentage specified by the user, are expected to be
differentially expressed. If there are no missing values in your data, then
the moderated t and B statistics will rank the genes in exactly the same
order. Even you do have spot weights or missing data, the p-values and
B-statistics will usually provide a very similar ranking of the genes.
Please keep in mind that the moderated t-statistic p-values and the
B-statistic probabilities depend on various sorts of mathematical
assumptions which are never exactly true for microarray data. The
B-statistics also depend on a prior guess for the proportion of
differentially expressed genes. Therefore they are intended to be taken as
a guide rather than as a strict measure of the probability of differential
expression. Of the three statistics, the moderated-t, the associated
p-value and the B-statistics, we usually base our gene selections on the
p-value. All three measures are closely related, but the moderated-t and
its p-value do not require a prior guess for the number of differentially
expressed genes.
The above mentioned statistics are computed for every contrast for each
gene. The eBayes() function computes one more useful statistic. The
moderated F-statistic (F) combines the t-statistics for all the contrasts
for each gene into an overall test of significance for that gene. The
moderated F-statistic tests whether any of the contrasts are non-zero for
that gene, i.e., whether that gene is differentially expressed on any
contrast. The moderated-F has numerator degrees of freedom equal to the
number of contrasts and denominator degrees of freedom the same as the
moderated-t. Its p-value is stored as fit$F.p.value. It is similar to the
ordinary F-statistic from analysis of variance except that the denominator
mean squares are moderated across genes.
In a complex experiment with many contrasts, it may be desirable to select
genes firstly on the basis of their moderated F-statistics, and
subsequently to decide which of the individual contrasts are significant
for those genes. This cuts down on the number of tests which need to be
conducted and therefore on the amount of adjustment for multiple testing.
The functions classifyTestsF() and decideTests() are provided for this purpose.
More information about the Bioconductor
mailing list