[BioC] Filtering is not recommended with LIMMA?

Gordon K Smyth smyth at wehi.EDU.AU
Wed Jun 5 09:44:35 CEST 2013

Dear Wolfgang,

With all respect, I meant exactly what I said.

You have taken the discussion out of context, and some of your claims are 
wrong in my opinion.

On Sun, 26 May 2013, Wolfgang Huber wrote:

> Dear Gordon
>> The literature tends to say that the reason for filtering is to reduce 
>> the amount of multiple testing, but in truth the increase in power from 
>> this is only slight.  The more important reason for filtering in most 
>> applications is to remove highly variable genes at low intensities. 
>> The importance of filtering is highly dependent on how you 
>> pre-processed your data.  Filtering is less important if you (i) use a 
>> good background correction or normalising method that damps down 
>> variability at low intensities and (ii) use eBayes(trend=TRUE) which 
>> accommodates a mean-variance trend.

You have taken out of context one paragraph from my reply to Miriam:


I was answering a specific question about the limma package, but you have 
lost that context.  You don't even include the date of the post you are 
replying to.

> With all respect, I think this paragraph mixes up two separate issues 
> and can benefit from clarification.
> 1. While literature can probably be found to support any statement, the 
> above-cited reason is indeed bogus when multiple testing is performed 
> with an FDR objective.

Not bogus. Just less important than some other considerations.

> The paper by Bourgon et al. motivates filtering differently, namely by 
> using a filter criterion that is independent of the test statistic under 
> the null (thus does not affect type-I error; some subtlety is discussed 
> in that paper) but dependent under the alternative (thus improves 
> power).

This is a good time to recall that the question was about filtering with 
the limma package, not about filtering in conjunction with t-tests or 
permutation tests.  Your paper (Bourgon et al) provides no motivation for 
filtering in conjunction with limma.  Quite the opposite, your paper 
concludes (incorrectly IMO) on its final page that limma needs to be used 

In reality, filtering low intensity probes (not low variance probes) is 
usually of benefit to limma, and we do this routinely for nearly all 
analyses in my lab.  This is for a number of reasons.

First there is the generic (not specific to limma) reason that probes that 
are not detecting real signal to any worthwhile degree for any sample 
cannot be detecting DE to any worthwhile degree.  Therefore there is a 
positive correlation between mean log intensity and true DE.

Second there is the limma-specific reason that probes that are not 
detecting signal above background levels in any sample trend to have 
atypical variances, both in absolute size and in terms of mean-variance 
relationship, compared to probes that are responding to genuine biological 
signal.  In other words, non-expressed or dead probes have variances that 
cannot be considered to be sampled from the same population as variances 
for probes from regular expressed probes.  It is desirable to get rid of 
these atypical probes so that limma can concentrate on the behaviour of 
probes of genuine interest.

Filtering by mean log-intensity does not cause any problems for the limma 
probabilistic model.  Indeed it generally improves concordance with the 
empirical Bayes assumptions.

> 2. "Highly variable genes at low intensities" are indeed a problem of 
> bad preprocessing and are better dealt with at that level, not by 
> filtering.

I agree in most cases, but it's not universally true.  Pre-processing 
methods that damp down variality at low intensities also tend to attenuate 
fold changes.  In some applications it can be legitimate to allow higher 
variability at low intensities in order to maintain dynamic range in the 
fold changes.  voom is one such application where the preprocessed and 
normalized expression values are deliberately kept more variable at the 
low end than the high end.

> Nowadays, the commonly used methods for expression microarray or RNA-Seq 
> analysis that I am aware of avoid that problem.

Yes, the high variability is gone but the non-expressed probes are still 
atypical.  With most commonly used methods, the non-expressed probes now 
have atypically small variances.  For example, the RMA algorithm (used in 
your paper) yields a mean variance relationship that increases at low 
intensities then decreases again at high intensities.  The lowest 
intensity probes have variances almost zero.  This effect is even stronger 
using the vst algorithm for Illumina BeadArrays (you are an author of the 
vst paper).  This method typically generates a very pronounced 
(increasing) mean-variance trend for probes at very low levels.

Anyone can see this by using the plotSA() function in limma to plot the
mean-variance relationship.

Atypical low variances mitigate the potential benefits from the empirical 
Bayes algorithm just as do atypical large variances, so the benefit that 
derives from filtering non-expressed probes remains.

The reason I worded my post in terms of high variances was simply because 
the strongest and most frequent arguments for filtering were made over 10 
years ago when large variances were common.

> 3. The question when & how independent filtering (as in 1) is beneficial 
> is quite unrelated to preprocessing.

I strongly disagree.  The benefit that may or may not come from filtering 
is intimately connected to the behaviour of the data, especially to the 
mean-variance trend, and this depends intimately on the platform and on 
the preprocessing.


> You are right that FDR is a property of the whole selected gene list, 
> not of individual genes, and that different approaches exist for 
> spending the type-I error budget wisely, by weighting different genes 
> differently; of which independent filtering is one and trended eBayes 
> (which is not the default option in limma) may be another.
> 	Best wishes
> 	Wolfgang
> Reference:
> Bourgon et al. PNAS 2010: http://www.pnas.org/content/107/21/9546

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list