[BioC] Data filtering

Wed Oct 10 16:15:49 CEST 2012

Hi,

Just a quick comment regarding the pre-filtering steps:

On Wed, Oct 10, 2012 at 6:06 AM, Mark Robinson
<mark.robinson at imls.uzh.ch> wrote:
[snip]
>> So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
>> removing genes that have low expression (count) levels
>> removing genes that have high variance across replicates
>> removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)
>
>
> I understand the first one (removing low counts) and would recommend it, but the statistics should somewhat take care of highlighting which genes are differential, between the contrast of interest that you specify.  So, are your second and third really necessary?

I usually get a bit wary when I bring "the labels" of my data into
consideration when filtering (so, the "within replicate stuff" smells
a bit fishy to me).

A more rigorous analysis of what you could (and probably shouldn't) do
has been published by some familiar names in the bioconductor
community:

Independent filtering increases detection power for high-throughput experiments
http://www.pnas.org/content/107/21/9546.long

Which is probably worth reading ...

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact