[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Xiaohui Wu wux3 at muohio.edu
Sat May 21 10:25:08 CEST 2011


Hello,

I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. 
I have some questions on how/when to filter out genes with low counts and I am also confused on the final results.

I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition.
The total counts in each replicate is as following:
cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 
1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 

1. For DESeq
I used all 50,000 genes to estimate size factor for normalization. 
Then should I remove genes with low count before estimating variance or after that? 
I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets.

1) no filtering before estimating variance 
d <- newCountDataSet( dat, group )
d <- estimateSizeFactors( d)
sf=sizeFactors(d)
d <- estimateVarianceFunctions( d )
res <- nbinomTest( d,"cond1","cond2")
resSig <- res[ res$padj < .1, ]

2) filtering before estimating variance
filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes
filter.ds <- newCountDataSet( filter, group )
filter.ds$sizeFactor=sf ## use the sizefactor from 1)
filter.ds <- estimateVarianceFunctions( filter.ds )
filter.res <- nbinomTest( filter.ds,"cond1","cond2")
filter.sig=filter.res[ filter.res$padj < .1, ]
nrow(filter.sig)

Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2).

2. For EdgeR
1) when should I remove low count genes? 
  f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right?
  f <- f/exp(mean(log(f)))
  dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? 
  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)

2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05)

Any help or suggestions will be appreciated.

Thanks,
Xiaohui



More information about the Bioconductor mailing list