[BioC] Log transformation and left censoring

Tue Feb 5 01:23:32 CET 2013

Dear Paul,

The transformation that you propose is the same transformation that is 
done by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the 
developmental version of the edgeR package.  The argument prior.count 
controls the moderation amount.

This is the same transformation that we recommend and use ourselves for 
heatmaps.  See Section 2.10 of the edgeR User's Guide:

http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

There is an example of its use on page 58 of the User's Guide.

Belinda Phipson has shown as part of her PhD work that, under some 
assumptions, this transformation comes close to minimizing the mean square 
error when predicting the true log fold changes.

Simply putting these logCPM values into limma will perform comparably to 
voom if the library sizes are not very different, provided that you use 
eBayes(fit,trend=TRUE).  When the library sizes are different, however, 
voom is the clear winner.

There is no censoring.  A major reason for adding an offset (aka 
prior.count) to the counts is to avoid the need to censor, truncate or 
remove observations.  Rather a mononotic transformation of the counts is 
performed for each library.

Best wishes
Gordon

On Jan 31, 2013, 8:57 AM, Paul Harrison <Paul.Harrison at monash.edu> wrote:

> Hello,
>
> We have been using voom and limma for some time now, and while we're 
> fairly happy with it, it seems to produce significance levels that are 
> on the conservative side. We also use edgeR to produce more optimistic 
> results, but don't entirely trust the significance levels that it 
> reports. I am looking for something in-between these extremes, and want 
> to run an idea past this list as a sanity check. I would especially 
> value Gordon and Charity's comments if they have time.
>
> The voom log transformation is essentially:
>
>  log2( (count+0.5) / library.size )
>
> It then does some clever things with weights. What I'm considering 
> instead is
>
>  log2( count / library.size + moderation.amount / mean.library.size )
>
> where moderation.amount is much larger then 0.5, say 5. A couple of 
> things here:
>
> - Instead of down-weighting low counts, I'm trying to get rid of the 
> extra variation from low counts by artificially left censoring the data.
>
> - I'm using the mean of the libaray sizes because I want the left censor 
> to be in the same place for each sample even if the library sizes are 
> different, so that if a gene is entirely switched off in one condition 
> it won't look variable just because there is a different left censor in 
> each sample.
>
> I'm also using this transformation to create heatmaps.
>
> This seems to be working with the data set I am working with, I get more 
> significant results and they seem reasonable by eye. It seems to me that 
> even if this approach isn't ideal it should at least be safe, at worst 
> it will cause limma to reduce the df.prior and produce less significant 
> results. Anything I've missed?
>
> --
> Paul Harrison
>
> Victorian Bioinformatics Consortium / Monash University

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}