[BioC] limma

Sat Apr 9 03:35:54 CEST 2011

Hi Wolfgang,

> Date: Thu, 07 Apr 2011 12:07:05 +0200
> From: Wolfgang Huber <whuber at embl.de>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] limma
>
> Hi Gordon
>
>> .... "limma ensures that all probes are
>> assigned at least a minimum non-zero expression level on all arrays, in
>> order to minimize the variability of log-intensities for lowly expressed
>> probes. Probes that are expressed in one condition but not other will be
>> assigned a large fold change for which the denominator is the minimum
>> expression level. This approach has the advantage that genes can be
>> ranked by fold change in a meaningful way, because genes with larger
>> expression expression changes will always be assigned a larger fold
>> change."

This comment was in the context of genes expressed in one condition and 
not the other (and was part of a longer post).  In this context the 
estimated fold change is essentially monotonic in the higher expression 
level, provided the zero value is offset away from zero, so larger 
expression changes do translate into larger fold changes.  In other 
contexts, it is a question of importance ranking, which I guess is the 
issue that you're raising below.

> I am not sure I follow:
>
> (i)  (20 + 16) / (10 + 16)   <   (15000 + 16) / (10000 + 16)
>
> but
>
> (ii)     20    /     10      >        15000   / 10000
>
> You assume that measurements of 20 and 10 are less reliable (or perhaps
> biologically less important?) than measurements of 20000 and 10000, thus
> that ranking (i) should be used

Generally I rank probes by a combination of statistical significance and 
fold change, not by fold change alone.  However, the discussion is in the 
context of Illumina expression data, and Illumina intensities of 10 and 20 
are almost certain to be from non-expressed probes, hence contain no 
biological signal.  So, yes, I would generally view measurements of 20000 
and 10000 as both statistically more precise and biologically more 
important than 20 and 10, and I would therefore want to rank as (i) rather 
than (ii).  I'm pretty sure that you would too.

> - but that depends on an error model (which you encode in the 
> pseudocount parameter '16')

I put more faith in experimental evidence than I do in statistical error 
models.  The fact that offsetting the intensities away from zero reduces 
the FDR is an observation from considerable testing with calibration data 
sets.  The evidence doesn't rely on an error model.  Much of the evidence 
is laid out in the paper that I cited in my earlier email:

Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus bias 
trade-off for Illumina Whole Genome Expression BeadChips. Nucleic Acids 
Research 38, e204.

> and a subjective trade-off between precision and effect size.

The fact that the value is chosen from experience with data, rather than 
as as a parameter estimated from a mathematical model, doesn't make it 
subjective.  As I've said, I take mathematical models with a grain of 
salt.

It's easy to verify experimentally that well known preprocessing 
algorithms, like RMA for Affy data or vst for Illumina data (you're an 
author!), also have the effect of offsetting intensities away from zero 
before logging them.  I think it is a useful insight to observe that this 
offsetting is a good part of why those algorithms have good statistical 
properties.  vst has an effective offset of around 200 (Wei et al, Tables 
2 and 3).  As far as I know, the offset was not designed into either of 
the above algorithms.  I suspect it was rather a fortuitious but 
unexpected outcome.  The offset that vst seems to have isn't a natural 
outcome of the variance stabilization model, because it generally turns 
out to be much larger than the offset that would best stabilize the 
variance.  Anyway, we find that by using more modest offsets in the range 
16-50 for Illumina data, we can achieve FDR as good as vst but with less 
bias, much less contraction of the fold changes.  Again, this is a 
conclusion from testing rather than from modelling.

I prefer to make the offset explicit, clearly visible to users, rather 
than leaving it implicit or unexpected.  This approach (neqc etc) isn't 
the only good way to address noise, bias and variance stabilization 
issues, but it's the one that seems to work best for me at the moment.

Cheers
Gordon

> I agree with you that the approach is useful, and also that it is good
> to provide a very simple recipe for people that either cannot deal with
> or do not care about the quantitative details. Still, this post is for
> the people that do :)
>
> Cheers
> Wolfgang

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}