[BioC] RNA-seq, low count filtering and multiple testing

Tue Dec 11 08:52:20 CET 2012

Hi all,

I understand it is normal to filter out lowly expressed genes before
performing differential expression analysis on RNA-seq data (e.g.,
edgeR, DESeq).

However I notice with such methods as edgeR, I find a number of genes
where there is clearly one outlier that is causing the gene to be
deemed significantly DE (thought the dispersion value is quite high):

for example

           control1 control2 control3 case1 case2 case3
geneA           0          1          3        1        2        30

Note that case3 is not an outlier sample, MDS plots show it to be like
the other case samples, and the phenotype of the samples is as we
would expect. I would say this gene is an outlier rather than the
sample being an outlier, if that makes sense.

Would it be fair to filter such examples out? I am thinking of a
filtering rule such that:

for each gene, if it has a number of counts below X for at least one
case sample AND at least one control sample, discard it.

This way I don't get rid of genes where the expression is high in case
and very low (or unexpressed) in control and vice versa.

However, I understand that this means I will be using the class labels
for my filtering step, which I believe might lead to problems at the
multiple testing correction stage.

Thanks in advance for any help/ideas on this issue.

Jim

-- 
James Perkins
Institute of Structural and Molecular Biology
Division of Biosciences
University College London
Gower Street
London, WC1E 6BT
UK

email: j.perkins at ucl.ac.uk