[BioC] Data filtering

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Oct 10 16:15:49 CEST 2012


Just a quick comment regarding the pre-filtering steps:

On Wed, Oct 10, 2012 at 6:06 AM, Mark Robinson
<mark.robinson at imls.uzh.ch> wrote:
>> So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
>> removing genes that have low expression (count) levels
>> removing genes that have high variance across replicates
>> removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)
> I understand the first one (removing low counts) and would recommend it, but the statistics should somewhat take care of highlighting which genes are differential, between the contrast of interest that you specify.  So, are your second and third really necessary?

I usually get a bit wary when I bring "the labels" of my data into
consideration when filtering (so, the "within replicate stuff" smells
a bit fishy to me).

A more rigorous analysis of what you could (and probably shouldn't) do
has been published by some familiar names in the bioconductor

Independent filtering increases detection power for high-throughput experiments

Which is probably worth reading ...


Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the Bioconductor mailing list