[BioC] A metric to determine best filtration in the limma package

Thu Sep 6 20:20:15 CEST 2012

Hi,

On Thu, Sep 6, 2012 at 1:52 PM, Mark Lawson <mlawsonvt09 at gmail.com> wrote:
> Hello Bioconductor Gurus!
>
> (I apologize if this goes through more than once)
>
> We are currently using limma (through the voom() function) to analyze
> RNA-seq data, represented as RSEM counts. We currently have 246 samples
> (including replicates) and our design matrix has 65 columns.
>
> My question is in regard to how much we should be filtering our data before
> running it through the analysis pipeline. Our current approach is to look
> for a CPM of greater than 2 in at least half of the samples. The code is:
>
> keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)

I'm guessing you are using "normal" rna-seq data (ie. it's not a tag
sequencing something), so just a quick thought (apologies in advance
if I am misunderstanding your setup):

If you are filtering by counts per million without normalizing for
approximate length of your transcript (like an R/FPKM-like measure),
aren't you biasing your filter (and, therefore, data)?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact