[BioC] A metric to determine best filtration in the limma package

Gordon K Smyth smyth at wehi.EDU.AU
Wed Sep 12 03:47:41 CEST 2012

Dear Aaron,

On Tue, 11 Sep 2012, Aaron Mackey wrote:

> Thanks Gordon.  Would your advice about X still hold if the smallest group
> size is 5 out of 500 total samples (i.e. an experimental design profiling
> 100 treatments with 5 replicates each)?  5 out of 5000 samples?

If you want to be able to detect genes that are expressed only in the 
group with the smallest size, then yes.

> It seems to me that the available df should somehow impact your choice 
> of X, and that for very large df, you'd wish to raise X somewhat.

I prefer to use a filtering rule that relates your scientific questions. 
How many libraries would a gene have to be expressed in to be 
scientifically interesting to you?  That should drive X, not a purely 
numerical rule that knows nothing about your experimental design.

> What I'm trying to get at is that it seems, at least to my intuition, that
> the desire to keep "at least X nonzero" stems from a possibility of a
> transcript being entirely "off" (zero counts) in all but the single,
> smallest group where it is found "on" (non-zero counts)


> -- but the models we use don't really test/allow for such "binary" 
> switching behavior (i.e. infinite fold changes); so instead we're really 
> just trying to weed out those transcripts that tend to have zero counts 
> more often than others (which may not be due to expression at all, but 
> other artifacts of the experiment, including vagaries of read mapping), 
> that negatively impact the statistics of what we can model (non-infinite 
> fold changes).

I understand the issue about wanting to remove transcripts with aberrant 
observations, but I think this is data dependent.

You originally posted that you were worried that you were doing too much 
filtering.  I have advised you that there is no technical reason why you 
need to do so much filtering.  If you nevertheless want to keep the 
filtering just as stringent, then it's your judgement based on your 
understanding of your experiment!  I can't comment further.

Note that edgeR (and voom to a lesser extent) uses a shrinkage strategy 
that protects against infinite fold changes.

> Thus, we have a desire to come up with a metric or decision criterion for
> which the "rule of thumb" is a good proxy in small/modest-sized experiments
> but which can also be extended in larger many-factorial experiments ...
> i.e. something akin to getPriorN().  I imagine some tuning strategy (much
> like lasso/ridge) where the decrease in error associated with more
> aggressive filtering is offset by a loss in power, until a balance/crossing
> point is found; in an edgeR (not voom) scenario, this might be some aspect
> of the BCV distribution/loess fitt vs. some multivariate F-statistic of the
> experimental design?

I don't really follow your thinking here.  If you have a huge experiment 
with too many DE genes to be conveniently interpreted, then I would 
suggest using the magnitude of the fold changes, and predictive (i.e., 
shrunk) fold changes from edgeR in particular, to prioritize.

Best wishes

> Thanks to all for brainstorming this,
> -Aaron
> On Mon, Sep 10, 2012 at 7:59 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>> Dear Mark,
>> I think that voom() should be pretty tolerant of the amount of filtering
>> that is done, so you can feel free to be more inclusive.
>> Note that our recommended filtering is
>>   keep <- rowSums(cpm(dge) > k) >= X
>> where X is the sample size of the smallest group size.  Since X is usually
>> smaller than half the number of arrays, our recommended filtering is
>> usually more inclusive than the filter you give.
>> You are also free to vary k, depending on your sequencing depth.  The idea
>> is to filter low counts.
>> Best wishes
>> Gordon
>> -------------- original message -------------
>> [BioC] A metric to determine best filtration in the limma package
>> Aaron Mackey amackey at virginia.edu
>> Mon Sep 10 16:27:21 CEST 2012
>> Hello Bioconductor Gurus!
>> (I apologize if this goes through more than once)
>> We are currently using limma (through the voom() function) to analyze
>> RNA-seq data, represented as RSEM counts. We currently have 246 samples
>> (including replicates) and our design matrix has 65 columns.
>> My question is in regard to how much we should be filtering our data
>> before running it through the analysis pipeline. Our current approach is to
>> look for a CPM of greater than 2 in at least half of the samples. The code
>> is:
>> keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)
>> This brings down our transcript count from 73,761 to less than 20,000.
>> While we do see groupings and batch effects we expect to see in the MDS
>> plots, we are afraid we might be filtering too severely.
>> So finally my question: What is a good metric for determining how well we
>> have filtered the data?
>> Thank you,
>> Mark Lawson, PhD

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list