[BioC] RNA-seq, low count filtering and multiple testing

Wed Dec 12 08:25:40 CET 2012

Dear James,

It does make sense to view a gene as an outlier (or a particular count as 
an outlier), although the data example you give you doesn't look all that 
bad, depending on the sequencing depths.

In the future, edgeE will automatically detect outlier genes and 
downweight them in an appropriate way (our yet to be made public in-house 
version of edgeR already does that).

Using the current official release of edgeR, you could simply reduce the 
prior.df when estimating the tagwise dispersions, say:

    y <- estimateGLMTagwise(y, design, prior.df=5)

This will ensure that genes with outlier or very variable counts are 
down-weighted more than is the default.

If you must filter genes on variability, then it would be better to do so 
based on goodness of fit statistics, rather than on the ad hoc filter you 
propose.  You can explore outliers using the gof (goodness of fit) 
function in edgeR, for example

   fit <- glmFit(y,design)
   gof(fit, plot=TRUE)

will make a Q-Q plot from which outliers can be identified.

The gof() function will also compute p-values and flag outlier genes for 
you.

Best wishes
Gordon

> Date: Tue, 11 Dec 2012 08:52:20 +0100
> From: James Perkins <j.perkins at ucl.ac.uk>
> To: Bioconductor at r-project.org
> Subject: [BioC] RNA-seq, low count filtering and multiple testing
>
> Hi all,
>
> I understand it is normal to filter out lowly expressed genes before 
> performing differential expression analysis on RNA-seq data (e.g., 
> edgeR, DESeq).
>
> However I notice with such methods as edgeR, I find a number of genes 
> where there is clearly one outlier that is causing the gene to be deemed 
> significantly DE (thought the dispersion value is quite high):
>
> for example
>
>           control1 control2 control3 case1 case2 case3
> geneA           0          1          3        1        2        30
>
> Note that case3 is not an outlier sample, MDS plots show it to be like 
> the other case samples, and the phenotype of the samples is as we would 
> expect. I would say this gene is an outlier rather than the sample being 
> an outlier, if that makes sense.
>
> Would it be fair to filter such examples out? I am thinking of a 
> filtering rule such that:
>
> for each gene, if it has a number of counts below X for at least one 
> case sample AND at least one control sample, discard it.
>
> This way I don't get rid of genes where the expression is high in case 
> and very low (or unexpressed) in control and vice versa.
>
> However, I understand that this means I will be using the class labels 
> for my filtering step, which I believe might lead to problems at the 
> multiple testing correction stage.
>
> Thanks in advance for any help/ideas on this issue.
>
> Jim
>
> -- 
> James Perkins
> Institute of Structural and Molecular Biology
> Division of Biosciences
> University College London
> Gower Street
> London, WC1E 6BT
> UK
>
> email: j.perkins at ucl.ac.uk
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}