[BioC] genefilter vs limma - many probes filtered

Wolfgang Huber whuber at embl.de
Tue May 27 20:47:31 CEST 2014


Dear Marcin

the platform used for GSE48060 is the Affymetrix Human Genome U133 Plus 2.0 Array. The cdf package ‘hgu133plus2cdf’ defines 54675 probe sets. I find it not implausible that a large fraction of these does not map to proper genes, or not to genes that are expressed in blood. In that case, filtering these out is beneficial. This is what the plot that you link below indicates. 

Re methodology, see also other my other, following post. 

Kind regards
		Wolfgang

On 23 May 2014, at 13:22, Marcin Jakub Kamiński <marcinjakubkaminski at gmail.com> wrote:

> Hello Ryan,
> thanks for your clear elucidation on this.
> Shame to admit, but after performing some additional reading I believe that
> my question should (at least partially) have never been asked - in limma
> guide it's advised to filter-out low intensities rather than low variances
> and more details can be found in this discussion:
> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053071.html, which in
> fact agrees with your response.
> However, I'm still unable to find any straightforward answer to the
> question about filtering by variance after the eBayes() procedure (
> https://stat.ethz.ch/pipermail/bioconductor/2012-March/043895.html,
> https://stat.ethz.ch/pipermail/bioconductor/2009-October/030062.html).
> Also, I'm still worried about such 'beneficial' change after extensive
> filtering, especially as I didn't found any cases, when >50% of genes have
> been filtered.
> 
> Best regards,
> Marcin
> 
> 
> 
> On Fri, May 23, 2014 at 5:33 AM, Ryan <rct at thompsonclan.org> wrote:
> 
>> Hi Marcin,
>> 
>> I believe that performing variance filtering is not compatible with the
>> empirical Bayes methods employed in limma. The point of limma is to compute
>> a moderated estimate of each gene's variance by using the average variance
>> across all genes as a prior estimate. If you filter out genes based on
>> their variance, then you will bias that prior estimate, and this bias will
>> propagate to the posterior estimates. For example, if you filter out
>> high-variance genes, limma will underestimate the prior variance, and
>> overestimate the significance of your differential expression calls, which
>> is not a desirable outcome.
>> 
>> It may possibly be defensible to perform variance filtering after the
>> empirical Bayes step, but I'm not sure, and you would have to ask someone
>> more knowledegable about such matters.
>> 
>> -Ryan
>> 
>> 
>> On Thu May 22 18:41:24 2014, Marcin Kaminski [guest] wrote:
>> 
>>> Dear list,
>>> I've followed the tips regarding gene filtering at
>>> http://www.bioconductor.org/packages/release/bioc/
>>> vignettes/genefilter/inst/doc/independent_filtering.pdf when analyzing
>>> GEO data (GSE48060). In this case most probes would pass the tests (for
>>> adj.p. < .05) if I filter out roughly 70% of them based on variance, which
>>> will triple the number of positives compared to not filtering at all.
>>> (related graphic: http://i.imgur.com/RuuvRIo.png)
>>> Should I be concerned about such extensive filtering? Does it affect
>>> further analysis with limma and introduce bias? If it's a problem, what are
>>> the available solutions or diagnostics?
>>> 
>>> Thanks for your help!
>>> 
>>> Best regards,
>>> Marcin
>>> 
>>> 
>>>  -- output of sessionInfo():
>>> 
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> 
>>> locale:
>>> [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250
>>> LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
>>> [5] LC_TIME=Polish_Poland.1250
>>> 
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> base
>>> 
>>> other attached packages:
>>>  [1] RColorBrewer_1.0-5    hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0
>>> RSQLite_0.11.4        DBI_0.2-7             AnnotationDbi_1.26.0
>>>  [7] GenomeInfoDb_1.0.2    genefilter_1.46.1     matrixStats_0.8.14
>>> limma_3.20.3          GEOquery_2.30.0       Biobase_2.24.0
>>> [13] BiocGenerics_0.10.0
>>> 
>>> loaded via a namespace (and not attached):
>>>  [1] annotate_1.42.0   IRanges_1.22.6    R.methodsS3_1.6.1
>>> RCurl_1.95-4.1    splines_3.1.0     stats4_3.1.0      survival_2.37-7
>>> tools_3.1.0
>>>  [9] XML_3.98-1.1      xtable_1.7-3
>>> 
>>> 
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.
>>> science.biology.informatics.conductor
>>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list