[BioC] Log transformation and left censoring
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Feb 5 01:23:32 CET 2013
Dear Paul,
The transformation that you propose is the same transformation that is
done by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the
developmental version of the edgeR package. The argument prior.count
controls the moderation amount.
This is the same transformation that we recommend and use ourselves for
heatmaps. See Section 2.10 of the edgeR User's Guide:
http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
There is an example of its use on page 58 of the User's Guide.
Belinda Phipson has shown as part of her PhD work that, under some
assumptions, this transformation comes close to minimizing the mean square
error when predicting the true log fold changes.
Simply putting these logCPM values into limma will perform comparably to
voom if the library sizes are not very different, provided that you use
eBayes(fit,trend=TRUE). When the library sizes are different, however,
voom is the clear winner.
There is no censoring. A major reason for adding an offset (aka
prior.count) to the counts is to avoid the need to censor, truncate or
remove observations. Rather a mononotic transformation of the counts is
performed for each library.
Best wishes
Gordon
On Jan 31, 2013, 8:57 AM, Paul Harrison <Paul.Harrison at monash.edu> wrote:
> Hello,
>
> We have been using voom and limma for some time now, and while we're
> fairly happy with it, it seems to produce significance levels that are
> on the conservative side. We also use edgeR to produce more optimistic
> results, but don't entirely trust the significance levels that it
> reports. I am looking for something in-between these extremes, and want
> to run an idea past this list as a sanity check. I would especially
> value Gordon and Charity's comments if they have time.
>
> The voom log transformation is essentially:
>
> log2( (count+0.5) / library.size )
>
> It then does some clever things with weights. What I'm considering
> instead is
>
> log2( count / library.size + moderation.amount / mean.library.size )
>
> where moderation.amount is much larger then 0.5, say 5. A couple of
> things here:
>
> - Instead of down-weighting low counts, I'm trying to get rid of the
> extra variation from low counts by artificially left censoring the data.
>
> - I'm using the mean of the libaray sizes because I want the left censor
> to be in the same place for each sample even if the library sizes are
> different, so that if a gene is entirely switched off in one condition
> it won't look variable just because there is a different left censor in
> each sample.
>
> I'm also using this transformation to create heatmaps.
>
> This seems to be working with the data set I am working with, I get more
> significant results and they seem reasonable by eye. It seems to me that
> even if this approach isn't ideal it should at least be safe, at worst
> it will cause limma to reduce the df.prior and produce less significant
> results. Anything I've missed?
>
> --
> Paul Harrison
>
> Victorian Bioinformatics Consortium / Monash University
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list