[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Wed May 25 05:01:44 CEST 2011

Thanks, Gordon.

I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea.

1) Is the following step to filter out low counts gene right? Do I need to change the libsize using  d$counts>=16 or just remain the original libsize?
  f <- calcNormFactors(dat)
  f <- f/exp(mean(log(f)))      
  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
  d <- d[rowSums(d$counts) >=16, ] 

2) How to estimate which common.dispersion or prior.n is better? 
When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used:
d <- estimateCommonDisp(d) 
d <- estimateTagwiseDisp(d, prior.n = 10)
NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). 

In my data, the biological variation between the four replicates are high,  (the estimateCommonDisp was 5, which means very high variation I think). 
When I used: d$common.dispersion=2,  I got 10209 genes with pvalue<0.1,  and 9532 DE genes with adjusted pvalue<0.1.
When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes  with adjusted pvalue<0.1. 
But NO DE gene could be found even using prior.n=1.  
Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion?

Any help or suggestions will be appreciated.

Best,                                       
Xiaohui

-------------------------------------------------------------
Send：Gordon K Smyth
Date：2011-05-25 09:59:36
Receive：Wu, Xiaohui Ms.
Title：Filtering out tags with low counts in DESeq and EgdeR?

Dear Xiaohui,

I'll answer for edgeR, which is what I have experience with.

I personally pre-filter RNA-Seq data in much the same way that I filter 
microarray data.  I aim to keep transcripts that show some evidence of 
being expressed at a worthwhile level in at least one treatment group. 
If I am comparing two groups with sample sizes n1=5 and n2=10, then I will 
keep transcripts that achieve a minimum level of expression in at least 5 
samples (the minimum group size).  Note that this is done without regard 
to the sample labels, so no knowledge of group membership is used in the 
filtering.  The minimum expression level might be, say, 1 count per 
million.

Best wishes
Gordon

> Date: Sat, 21 May 2011 16:25:08 +0800
> From: "Xiaohui Wu" <wux3 at muohio.edu>
> To: "bioconductor" <bioconductor at r-project.org>
> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR?
>
> Hello,
>
> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any.
> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results.
>
> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition.
> The total counts in each replicate is as following:
> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4
> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892
>
> 1. For DESeq
> I used all 50,000 genes to estimate size factor for normalization.
> Then should I remove genes with low count before estimating variance or after that?
> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets.
>
> 1) no filtering before estimating variance
> d <- newCountDataSet( dat, group )
> d <- estimateSizeFactors( d)
> sf=sizeFactors(d)
> d <- estimateVarianceFunctions( d )
> res <- nbinomTest( d,"cond1","cond2")
> resSig <- res[ res$padj < .1, ]
>
> 2) filtering before estimating variance
> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes
> filter.ds <- newCountDataSet( filter, group )
> filter.ds$sizeFactor=sf ## use the sizefactor from 1)
> filter.ds <- estimateVarianceFunctions( filter.ds )
> filter.res <- nbinomTest( filter.ds,"cond1","cond2")
> filter.sig=filter.res[ filter.res$padj < .1, ]
> nrow(filter.sig)
>
> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2).
>
> 2. For EdgeR
> 1) when should I remove low count genes?
>  f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right?
>  f <- f/exp(mean(log(f)))
>  dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts?
>  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
>
> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05)
>
> Any help or suggestions will be appreciated.
>
> Thanks,
> Xiaohui

______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.
______________________________________________________________________
.