[BioC] A metric to determine best filtration in the limma package

Aaron Mackey mackey at hemoshear.com
Thu Sep 20 15:47:01 CEST 2012

Thanks again for your intuition.  I agree that if I wish to optimize my
sensitivity to detect genes that are "all off" in all but one experimental
group, I should simply filter low counts up to the sample size of that
smallest group.  But for a very large experimental design with 100's of
contrasts (say, a drug panel profiling experiment), I'm willing to
sacrifice losing a few of those genes in return for better (more stable)
experiment-wide statistics and sensitivity across all of the contrasts.

To test my hypothesis, we did a simple experiment, increasing the number of
X (min. # of samples with counts > threshold) from 1 (the smallest,
requiring counts in at least 1 sample), 2, 3, 4, 5, 10, etc, up to more
than 50% of the total sample size; at each value of X, we use the voom
pipeline (edgeR calcNormFactors, voom transformation, limma lmFit+eBayes),
and record the number of isoforms found DE at an FDR of 10% for each of ~50
independent contrasts.  Because some contrasts (treatments) generate
hundreds or thousands of DE isoforms, and others generate only a handful,
we chose to monitor the rank (rather than the total count) of each filter X
within each contrast as a comparable metric for the "yield" (i.e. for every
contrasts, some particular value of X has the most counts, and is the
"best"; and thus for each value of X we have ~50 rankings, one for each
contrast).  We break ties using ranks <- rank(..., ties.method="max"), and
then normalize with ranks <- ranks/max(ranks).  This yields the attached
plot, where we also plot the median line to highlight how the distribution
of ranks reaches a maximum somewhere around X=50 (in this case, about 1/5
of the sample size, and 10-times the smallest experimental grouping, which
is 5).

In another similar dataset, we get a peak at X=100, so I agree with
Gordon's statement that such an approach will be data-dependent; in
retrospect, we should probably perform such a "parameter tuning" using some
variety of cross-validation so that our final results are unbiased by this

Other comments or suggestions?  Are we misinterpreting our results?

Thanks again,

On Tue, Sep 11, 2012 at 9:47 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:

> Dear Aaron,
> On Tue, 11 Sep 2012, Aaron Mackey wrote:
>  Thanks Gordon.  Would your advice about X still hold if the smallest group
>> size is 5 out of 500 total samples (i.e. an experimental design profiling
>> 100 treatments with 5 replicates each)?  5 out of 5000 samples?
> If you want to be able to detect genes that are expressed only in the
> group with the smallest size, then yes.
>  It seems to me that the available df should somehow impact your choice of
>> X, and that for very large df, you'd wish to raise X somewhat.
> I prefer to use a filtering rule that relates your scientific questions.
> How many libraries would a gene have to be expressed in to be
> scientifically interesting to you?  That should drive X, not a purely
> numerical rule that knows nothing about your experimental design.
>  What I'm trying to get at is that it seems, at least to my intuition, that
>> the desire to keep "at least X nonzero" stems from a possibility of a
>> transcript being entirely "off" (zero counts) in all but the single,
>> smallest group where it is found "on" (non-zero counts)
> Yes.
>  -- but the models we use don't really test/allow for such "binary"
>> switching behavior (i.e. infinite fold changes); so instead we're really
>> just trying to weed out those transcripts that tend to have zero counts
>> more often than others (which may not be due to expression at all, but
>> other artifacts of the experiment, including vagaries of read mapping),
>> that negatively impact the statistics of what we can model (non-infinite
>> fold changes).
> I understand the issue about wanting to remove transcripts with aberrant
> observations, but I think this is data dependent.
> You originally posted that you were worried that you were doing too much
> filtering.  I have advised you that there is no technical reason why you
> need to do so much filtering.  If you nevertheless want to keep the
> filtering just as stringent, then it's your judgement based on your
> understanding of your experiment!  I can't comment further.
> Note that edgeR (and voom to a lesser extent) uses a shrinkage strategy
> that protects against infinite fold changes.
>  Thus, we have a desire to come up with a metric or decision criterion for
>> which the "rule of thumb" is a good proxy in small/modest-sized
>> experiments
>> but which can also be extended in larger many-factorial experiments ...
>> i.e. something akin to getPriorN().  I imagine some tuning strategy (much
>> like lasso/ridge) where the decrease in error associated with more
>> aggressive filtering is offset by a loss in power, until a
>> balance/crossing
>> point is found; in an edgeR (not voom) scenario, this might be some aspect
>> of the BCV distribution/loess fitt vs. some multivariate F-statistic of
>> the
>> experimental design?
> I don't really follow your thinking here.  If you have a huge experiment
> with too many DE genes to be conveniently interpreted, then I would suggest
> using the magnitude of the fold changes, and predictive (i.e., shrunk) fold
> changes from edgeR in particular, to prioritize.
> Best wishes
> Gordon
>  Thanks to all for brainstorming this,
>> -Aaron
>> On Mon, Sep 10, 2012 at 7:59 PM, Gordon K Smyth <smyth at wehi.edu.au>
>> wrote:
>>  Dear Mark,
>>> I think that voom() should be pretty tolerant of the amount of filtering
>>> that is done, so you can feel free to be more inclusive.
>>> Note that our recommended filtering is
>>>   keep <- rowSums(cpm(dge) > k) >= X
>>> where X is the sample size of the smallest group size.  Since X is
>>> usually
>>> smaller than half the number of arrays, our recommended filtering is
>>> usually more inclusive than the filter you give.
>>> You are also free to vary k, depending on your sequencing depth.  The
>>> idea
>>> is to filter low counts.
>>> Best wishes
>>> Gordon
>>> -------------- original message -------------
>>> [BioC] A metric to determine best filtration in the limma package
>>> Aaron Mackey amackey at virginia.edu
>>> Mon Sep 10 16:27:21 CEST 2012
>>> Hello Bioconductor Gurus!
>>> (I apologize if this goes through more than once)
>>> We are currently using limma (through the voom() function) to analyze
>>> RNA-seq data, represented as RSEM counts. We currently have 246 samples
>>> (including replicates) and our design matrix has 65 columns.
>>> My question is in regard to how much we should be filtering our data
>>> before running it through the analysis pipeline. Our current approach is
>>> to
>>> look for a CPM of greater than 2 in at least half of the samples. The
>>> code
>>> is:
>>> keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)
>>> This brings down our transcript count from 73,761 to less than 20,000.
>>> While we do see groupings and batch effects we expect to see in the MDS
>>> plots, we are afraid we might be filtering too severely.
>>> So finally my question: What is a good metric for determining how well we
>>> have filtered the data?
>>> Thank you,
>>> Mark Lawson, PhD
> ______________________________**______________________________**__________
> The information in this email is confidential and intend...{{dropped:4}}
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: filterRank.png
Type: image/png
Size: 227977 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120920/19bee242/attachment.png>

More information about the Bioconductor mailing list