[BioC] Log transformation and left censoring

Tue Feb 5 07:01:19 CET 2013

On Tue, Feb 5, 2013 at 11:23 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
> Dear Paul,
>
> The transformation that you propose is the same transformation that is done
> by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the
> developmental version of the edgeR package.  The argument prior.count
> controls the moderation amount.
>
> This is the same transformation that we recommend and use ourselves for
> heatmaps.  See Section 2.10 of the edgeR User's Guide:
>
> http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
>
> There is an example of its use on page 58 of the User's Guide.
>
> Belinda Phipson has shown as part of her PhD work that, under some
> assumptions, this transformation comes close to minimizing the mean square
> error when predicting the true log fold changes.
>
> Simply putting these logCPM values into limma will perform comparably to
> voom if the library sizes are not very different, provided that you use
> eBayes(fit,trend=TRUE).  When the library sizes are different, however, voom
> is the clear winner.
>

It's reassuring to see better statisticians than me thinking along
similar lines, including using the mean library size. A difference is
that the default prior.count is much smaller than I have been using,
0.25. But then in the example in section 2.10, you're a using
prior.count of 2.

The ideal prior.count for estimating fold changes might be different
than the ideal for visualization and clustering, and the ideal for
detecting differential expression might be different from the first or
both of these. With my large prior.count I'm detecting extra
differentially expressed genes, and in particular these are genes with
low average expression. Also, an artifact in my heatmaps, which seems
to be due to the different sample sizes, is removed. It might be that
my data is weird, but these improvements seem to be to do with having
a spread of library sizes.

> There is no censoring.  A major reason for adding an offset (aka
> prior.count) to the counts is to avoid the need to censor, truncate or
> remove observations.  Rather a mononotic transformation of the counts is
> performed for each library.
>

Apologies for being unclear. Yes, from the perspective of
untransformed counts, there is no censoring. I was viewing the problem
from the perspective of log transformed expression values, which is an
odd viewpoint. From this perspective, there's a (soft) line left of
which it's impossible estimate the log expression, just because we're
using count data (microarrays have a similar line, but for different
reasons).

Further, where the number of reads is different in different samples,
they have different (soft) cutoff lines. Statistical methods that work
with count data directly, such as edgeR, shouldn't be affected by
this, but when feeding log transformed values into limma it is a
potential problem. I definitely wouldn't want to remove low counts,
because I'm then likely to miss those genes with such a large fold
change that in one group the reads are almost entirely absent, and
these are likely to be exactly the genes I am looking for. So I want
to (softly) truncate low counts at a consistent point before log
transforming them.

One downside: even if this works for straightforward differential
expression, it will tend to see interactions that don't actually
exist. Voom's weighting system might mitigate this, and it would be an
interesting demonstration of voom's power.

Hmm. What I actually want is a test that works on counts directly,
with empirical priors on coefficients and the dispersion. Um. Assuming
we have model0 and model1, and some reasonable priors on the
parameters in these models. Say for the count data for a gene, x, we
have a way to compute P(x|model0) and P(x|model1). Use P(x|model1) as
a test statistic, the significance is the likelihood that P(x2|model1)
>= P(x|model1) for x2 sampled from model0. We can generate a few
million samples from model0 to give a distribution to test against,
only need to do this once, so it's computationally feasible.

For actually computing P(x|model): P(x | coefficients, dispersion) is
straightforward, and assume we have priors P(coefficients |
dispersion), P(dispersion). Would need to numerically integrate over
dispersion, I think. For a range of dispersion values find the MAP for
the coefficients then use an exp(2nd order Taylor expansion)
approximation around that to integrate P(x | coefficients, dispersion)
P(coefficients | dispersion) down to P(x | dispersion). Then sum P(x |
dispersion) P(dispersion) to get P(x).

Tweak emprical priors and repeat a few times. It wouldn't exactly be
fast, but it wouldn't be unworkably slow.

-- 
Paul Harrison

Victorian Bioinformatics Consortium / Monash University