[BioC] genefilter vs limma - many probes filtered

Tue May 27 20:47:41 CEST 2014

Dear Gordon

> Variance filtering should not be used at any stage of the limma analysis. You are right to be worried by it.  The Bioc posts you mention from 2009 and 2012 were about filtering by expression level, not by variance.
> 
> Variance filtering has only been shown to be valid and beneficial when using ordinary t-tests.  But greater benefits can be had by using the limma empirical Bayes t-test and filtering by expression.

Above all it depends on the particular data set which, if any, filter is beneficial, and which test (t, limma-t, …) leads to more experiment-wide power after FDR control. An analyst that wants to optimize their workflow needs to benchmark it on their data (as done by Marcin). The generic assertion above is misleading: greater benefits can be had from either method, has to depend on the data.

> If you think that very small or very large variances are an issue with your data, then you could discount them in a statistically valid way by using the robust option of the eBayes() function in limma.  Again this will give greater benefits than ad hoc filtering by observed variances.
> 
> Apart from the fact that variance filtering invalidates the limma algorithm (or any empirical Bayes algorithm),

It is not compatible with limma; the reason is explained in Fig. 2b/c of the PNAS paper (doi: 10.1073/pnas.0914005107), i.e. limma's inverse gamma model for the variances. It is easy to think of other empirical Bayes algorithms that are compatible with filtering (DESeq2 is only one example). As you know, all that is needed is independence of the filter criterion and the test statistic under the null.

> it also worries me that variance filtering lacks a good biological interpretation, whereas filtering by mean expression has the clear interpretation of removing genes that are not at worthwhile expression levels.

The explanation is simple - Affymetrix arrays have a strong gene-specific "probe effect”, which leads to a biology-independent baseline signal of each probe (and probe set). E.g. some probesets report a baseline of say 2.5 (log2 scale). even in absence of target mRNA, others one of 6. The biological signal is modulated on top of that baseline. This motivates using the variance, rather than mean, for Affymetrix arrays. For other data types, different reasonings apply.

	Kind regards
	Wolfgang

> 
> Best wishes
> Gordon
> 
> 
>> Date: Fri, 23 May 2014 13:22:23 +0200
>> From: Marcin Jakub Kami?ski <marcinjakubkaminski at gmail.com>
>> To: Ryan <rct at thompsonclan.org>
>> Cc: genefilter Maintainer <maintainer at bioconductor.org>,
>> 	bioconductor at r-project.org
>> Subject: Re: [BioC] genefilter vs limma - many probes filtered
>> 
>> Hello Ryan,
>> thanks for your clear elucidation on this.
>> Shame to admit, but after performing some additional reading I believe that
>> my question should (at least partially) have never been asked - in limma
>> guide it's advised to filter-out low intensities rather than low variances
>> and more details can be found in this discussion:
>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053071.html, which in
>> fact agrees with your response.
>> However, I'm still unable to find any straightforward answer to the
>> question about filtering by variance after the eBayes() procedure (
>> https://stat.ethz.ch/pipermail/bioconductor/2012-March/043895.html,
>> https://stat.ethz.ch/pipermail/bioconductor/2009-October/030062.html).
>> Also, I'm still worried about such 'beneficial' change after extensive
>> filtering, especially as I didn't found any cases, when >50% of genes have
>> been filtered.
>> 
>> Best regards,
>> Marcin
>> 
>> 
>> 
>> On Fri, May 23, 2014 at 5:33 AM, Ryan <rct at thompsonclan.org> wrote:
>> 
>>> Hi Marcin,
>>> 
>>> I believe that performing variance filtering is not compatible with the
>>> empirical Bayes methods employed in limma. The point of limma is to compute
>>> a moderated estimate of each gene's variance by using the average variance
>>> across all genes as a prior estimate. If you filter out genes based on
>>> their variance, then you will bias that prior estimate, and this bias will
>>> propagate to the posterior estimates. For example, if you filter out
>>> high-variance genes, limma will underestimate the prior variance, and
>>> overestimate the significance of your differential expression calls, which
>>> is not a desirable outcome.
>>> 
>>> It may possibly be defensible to perform variance filtering after the
>>> empirical Bayes step, but I'm not sure, and you would have to ask someone
>>> more knowledegable about such matters.
>>> 
>>> -Ryan
>>> 
>>> 
>>> On Thu May 22 18:41:24 2014, Marcin Kaminski [guest] wrote:
>>> 
>>>> Dear list,
>>>> I've followed the tips regarding gene filtering at
>>>> http://www.bioconductor.org/packages/release/bioc/
>>>> vignettes/genefilter/inst/doc/independent_filtering.pdf when analyzing
>>>> GEO data (GSE48060). In this case most probes would pass the tests (for
>>>> adj.p. < .05) if I filter out roughly 70% of them based on variance, which
>>>> will triple the number of positives compared to not filtering at all.
>>>> (related graphic: http://i.imgur.com/RuuvRIo.png)
>>>> Should I be concerned about such extensive filtering? Does it affect
>>>> further analysis with limma and introduce bias? If it's a problem, what are
>>>> the available solutions or diagnostics?
>>>> 
>>>> Thanks for your help!
>>>> 
>>>> Best regards,
>>>> Marcin
>>>> 
>>>> 
>>>>  -- output of sessionInfo():
>>>> 
>>>> R version 3.1.0 (2014-04-10)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>> 
>>>> locale:
>>>> [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250
>>>> LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
>>>> [5] LC_TIME=Polish_Poland.1250
>>>> 
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> base
>>>> 
>>>> other attached packages:
>>>>  [1] RColorBrewer_1.0-5    hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0
>>>> RSQLite_0.11.4        DBI_0.2-7             AnnotationDbi_1.26.0
>>>>  [7] GenomeInfoDb_1.0.2    genefilter_1.46.1     matrixStats_0.8.14
>>>> limma_3.20.3          GEOquery_2.30.0       Biobase_2.24.0
>>>> [13] BiocGenerics_0.10.0
>>>> 
>>>> loaded via a namespace (and not attached):
>>>>  [1] annotate_1.42.0   IRanges_1.22.6    R.methodsS3_1.6.1
>>>> RCurl_1.95-4.1    splines_3.1.0     stats4_3.1.0      survival_2.37-7
>>>> tools_3.1.0
>>>>  [9] XML_3.98-1.1      xtable_1.7-3
> 
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:4}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor