[BioC] A metric to determine best filtration in the limma package

Tue Sep 11 01:59:30 CEST 2012

Dear Mark,

I think that voom() should be pretty tolerant of the amount of filtering 
that is done, so you can feel free to be more inclusive.

Note that our recommended filtering is

   keep <- rowSums(cpm(dge) > k) >= X

where X is the sample size of the smallest group size.  Since X is usually 
smaller than half the number of arrays, our recommended filtering is 
usually more inclusive than the filter you give.

You are also free to vary k, depending on your sequencing depth.  The idea 
is to filter low counts.

Best wishes
Gordon

-------------- original message -------------
[BioC] A metric to determine best filtration in the limma package
Aaron Mackey amackey at virginia.edu
Mon Sep 10 16:27:21 CEST 2012

Hello Bioconductor Gurus!

(I apologize if this goes through more than once)

We are currently using limma (through the voom() function) to analyze 
RNA-seq data, represented as RSEM counts. We currently have 246 samples 
(including replicates) and our design matrix has 65 columns.

My question is in regard to how much we should be filtering our data 
before running it through the analysis pipeline. Our current approach is 
to look for a CPM of greater than 2 in at least half of the samples. The 
code is:

keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)

This brings down our transcript count from 73,761 to less than 20,000.
While we do see groupings and batch effects we expect to see in the MDS
plots, we are afraid we might be filtering too severely.

So finally my question: What is a good metric for determining how well we
have filtered the data?

Thank you,
Mark Lawson, PhD

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}