[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Gordon K Smyth smyth at wehi.EDU.AU
Wed May 25 03:59:23 CEST 2011


Dear Xiaohui,

I'll answer for edgeR, which is what I have experience with.

I personally pre-filter RNA-Seq data in much the same way that I filter 
microarray data.  I aim to keep transcripts that show some evidence of 
being expressed at a worthwhile level in at least one treatment group. 
If I am comparing two groups with sample sizes n1=5 and n2=10, then I will 
keep transcripts that achieve a minimum level of expression in at least 5 
samples (the minimum group size).  Note that this is done without regard 
to the sample labels, so no knowledge of group membership is used in the 
filtering.  The minimum expression level might be, say, 1 count per 
million.

Best wishes
Gordon

> Date: Sat, 21 May 2011 16:25:08 +0800
> From: "Xiaohui Wu" <wux3 at muohio.edu>
> To: "bioconductor" <bioconductor at r-project.org>
> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR?
>
> Hello,
>
> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any.
> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results.
>
> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition.
> The total counts in each replicate is as following:
> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4
> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892
>
> 1. For DESeq
> I used all 50,000 genes to estimate size factor for normalization.
> Then should I remove genes with low count before estimating variance or after that?
> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets.
>
> 1) no filtering before estimating variance
> d <- newCountDataSet( dat, group )
> d <- estimateSizeFactors( d)
> sf=sizeFactors(d)
> d <- estimateVarianceFunctions( d )
> res <- nbinomTest( d,"cond1","cond2")
> resSig <- res[ res$padj < .1, ]
>
> 2) filtering before estimating variance
> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes
> filter.ds <- newCountDataSet( filter, group )
> filter.ds$sizeFactor=sf ## use the sizefactor from 1)
> filter.ds <- estimateVarianceFunctions( filter.ds )
> filter.res <- nbinomTest( filter.ds,"cond1","cond2")
> filter.sig=filter.res[ filter.res$padj < .1, ]
> nrow(filter.sig)
>
> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2).
>
> 2. For EdgeR
> 1) when should I remove low count genes?
>  f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right?
>  f <- f/exp(mean(log(f)))
>  dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts?
>  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
>
> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05)
>
> Any help or suggestions will be appreciated.
>
> Thanks,
> Xiaohui

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list