[BioC] limma
    Gordon K Smyth 
    smyth at wehi.EDU.AU
       
    Mon Apr 11 09:14:52 CEST 2011
    
    
  
Wolfgang,
Another connection is that I apply the offset after normexp background 
correction (using negative control probes), which is itself related to 
generalized log transformations.  This means my x is always positive and 
strictly floored at zero before logging.
Gordon
On Sun, 10 Apr 2011, Wolfgang Huber wrote:
> Hi Gordon
>
> I think we agree, my only reason for making this post is to point out that 
> while the choice of '16' might be a good comprise overall, better choices can 
> exist depending on the dataset. E.g. if there are many replicates, then these 
> can compensate the imprecision from using a smaller offset and improved 
> biases may be had.
>
> Variance stabilising transformations are of course very related to the 
> offseting that you propose. The "glog" in vsn is
>
>  glog(x) = log( x+ sqrt(x^2 + c) )
>
> which is in practice similar to the shifted log, i.e. log (x+a). (It even 
> avoids problems that the shifted log has when x approaches -a. See also e.g. 
> PMID 12761059.)
>
> Such transformations provide a good way of looking at microarray data, and 
> indeed at high-thoughput sequencing data, for many though not all uses.
>
> As you say in your NAR article, there is a continuum of choices along the 
> precision - bias tradeoff axis, which can be parameterized by 'a' or 'c'.
>
> Btw, a subtlety of vst that many users seem not to be aware of is that it 
> uses the between beads variance (for the same gene, in the same sample) 
> rather than the more relevant (and larger) between replicates variance for 
> finding its 'c'.
>
> 	Best wishes
> 	Wolfgang
>
>
> PS A pet peeve of mine, how come that people are OK to state numbers such as 
> '16' without reference to units: ideally, 'official' SI units, but even an ad 
> hoc definition would do, based on a common agreement on reagent quantities, 
> scanner settings, software settings, to make the oxymoron 'arbitrary 
> fluorescence units' just a little bit less so.
>
>
>
> Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto:
>> Hi Wolfgang,
>> 
>>> Date: Thu, 07 Apr 2011 12:07:05 +0200
>>> From: Wolfgang Huber <whuber at embl.de>
>>> To: bioconductor at r-project.org
>>> Subject: Re: [BioC] limma
>>> 
>>> Hi Gordon
>>> 
>>>> .... "limma ensures that all probes are
>>>> assigned at least a minimum non-zero expression level on all arrays, in
>>>> order to minimize the variability of log-intensities for lowly expressed
>>>> probes. Probes that are expressed in one condition but not other will be
>>>> assigned a large fold change for which the denominator is the minimum
>>>> expression level. This approach has the advantage that genes can be
>>>> ranked by fold change in a meaningful way, because genes with larger
>>>> expression expression changes will always be assigned a larger fold
>>>> change."
>> 
>> This comment was in the context of genes expressed in one condition and
>> not the other (and was part of a longer post). In this context the
>> estimated fold change is essentially monotonic in the higher expression
>> level, provided the zero value is offset away from zero, so larger
>> expression changes do translate into larger fold changes. In other
>> contexts, it is a question of importance ranking, which I guess is the
>> issue that you're raising below.
>> 
>>> I am not sure I follow:
>>> 
>>> (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16)
>>> 
>>> but
>>> 
>>> (ii) 20 / 10 > 15000 / 10000
>>> 
>>> You assume that measurements of 20 and 10 are less reliable (or perhaps
>>> biologically less important?) than measurements of 20000 and 10000, thus
>>> that ranking (i) should be used
>> 
>> Generally I rank probes by a combination of statistical significance and
>> fold change, not by fold change alone. However, the discussion is in the
>> context of Illumina expression data, and Illumina intensities of 10 and
>> 20 are almost certain to be from non-expressed probes, hence contain no
>> biological signal. So, yes, I would generally view measurements of 20000
>> and 10000 as both statistically more precise and biologically more
>> important than 20 and 10, and I would therefore want to rank as (i)
>> rather than (ii). I'm pretty sure that you would too.
>> 
>>> - but that depends on an error model (which you encode in the
>>> pseudocount parameter '16')
>> 
>> I put more faith in experimental evidence than I do in statistical error
>> models. The fact that offsetting the intensities away from zero reduces
>> the FDR is an observation from considerable testing with calibration
>> data sets. The evidence doesn't rely on an error model. Much of the
>> evidence is laid out in the paper that I cited in my earlier email:
>> 
>> Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus
>> bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic
>> Acids Research 38, e204.
>> 
>>> and a subjective trade-off between precision and effect size.
>> 
>> The fact that the value is chosen from experience with data, rather than
>> as as a parameter estimated from a mathematical model, doesn't make it
>> subjective. As I've said, I take mathematical models with a grain of salt.
>> 
>> It's easy to verify experimentally that well known preprocessing
>> algorithms, like RMA for Affy data or vst for Illumina data (you're an
>> author!), also have the effect of offsetting intensities away from zero
>> before logging them. I think it is a useful insight to observe that this
>> offsetting is a good part of why those algorithms have good statistical
>> properties. vst has an effective offset of around 200 (Wei et al, Tables
>> 2 and 3). As far as I know, the offset was not designed into either of
>> the above algorithms. I suspect it was rather a fortuitious but
>> unexpected outcome. The offset that vst seems to have isn't a natural
>> outcome of the variance stabilization model, because it generally turns
>> out to be much larger than the offset that would best stabilize the
>> variance. Anyway, we find that by using more modest offsets in the range
>> 16-50 for Illumina data, we can achieve FDR as good as vst but with less
>> bias, much less contraction of the fold changes. Again, this is a
>> conclusion from testing rather than from modelling.
>> 
>> I prefer to make the offset explicit, clearly visible to users, rather
>> than leaving it implicit or unexpected. This approach (neqc etc) isn't
>> the only good way to address noise, bias and variance stabilization
>> issues, but it's the one that seems to work best for me at the moment.
>> 
>> Cheers
>> Gordon
>> 
>>> I agree with you that the approach is useful, and also that it is good
>>> to provide a very simple recipe for people that either cannot deal with
>>> or do not care about the quantitative details. Still, this post is for
>>> the people that do :)
>>> 
>>> Cheers
>>> Wolfgang
>> 
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for
>> the addressee.
>> You must not disclose, forward, print or use it without the permission
>> of the sender.
>> ______________________________________________________________________
>
> -- 
>
>
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber
>
>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
    
    
More information about the Bioconductor
mailing list