[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Gordon K Smyth smyth at wehi.EDU.AU
Wed May 25 09:23:57 CEST 2011


Dear Xiao,

Your dispersion value of 5 is higher than I have even seen for real data, 
This means that the coefficient of variation of expression is sqrt(5)=2.2, 
meaning the standard deviation is more than twice the mean.

With such wild data, it is no surprise that you get no DE genes.  In fact, 
I'd be a bit worried if edgeR did give you DE results.

It is not valid to simply input a smaller dispersion value than your data 
indicates.

Judging from the dispersion value, I think it is likely that you need to 
reconsider the basic quality of your data.  Trying to repair things with 
different statistical methods is not likely to solve the basic problem.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Wed, 25 May 2011, Xiaohui Wu wrote:

> Thanks, Gordon.
>
> I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea.
>
> 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using  d$counts>=16 or just remain the original libsize?
>  f <- calcNormFactors(dat)
>  f <- f/exp(mean(log(f)))
>  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
>  d <- d[rowSums(d$counts) >=16, ]
>
> 2) How to estimate which common.dispersion or prior.n is better?
> When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used:
> d <- estimateCommonDisp(d)
> d <- estimateTagwiseDisp(d, prior.n = 10)
> NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting).
>
> In my data, the biological variation between the four replicates are high,  (the estimateCommonDisp was 5, which means very high variation I think).
> When I used: d$common.dispersion=2,  I got 10209 genes with pvalue<0.1,  and 9532 DE genes with adjusted pvalue<0.1.
> When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes  with adjusted pvalue<0.1.
> But NO DE gene could be found even using prior.n=1.
> Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion?
>
> Any help or suggestions will be appreciated.
>
> Best,
> Xiaohui
>
>
>
>
> -------------------------------------------------------------
> Send:Gordon K Smyth
> Date:2011-05-25 09:59:36
> Receive:Wu, Xiaohui Ms.
> Title:Filtering out tags with low counts in DESeq and EgdeR?
>
> Dear Xiaohui,
>
> I'll answer for edgeR, which is what I have experience with.
>
> I personally pre-filter RNA-Seq data in much the same way that I filter
> microarray data.  I aim to keep transcripts that show some evidence of
> being expressed at a worthwhile level in at least one treatment group.
> If I am comparing two groups with sample sizes n1=5 and n2=10, then I will
> keep transcripts that achieve a minimum level of expression in at least 5
> samples (the minimum group size).  Note that this is done without regard
> to the sample labels, so no knowledge of group membership is used in the
> filtering.  The minimum expression level might be, say, 1 count per
> million.
>
> Best wishes
> Gordon
>
>> Date: Sat, 21 May 2011 16:25:08 +0800
>> From: "Xiaohui Wu" <wux3 at muohio.edu>
>> To: "bioconductor" <bioconductor at r-project.org>
>> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR?
>>
>> Hello,
>>
>> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any.
>> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results.
>>
>> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition.
>> The total counts in each replicate is as following:
>> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4
>> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892
>>
>> 1. For DESeq
>> I used all 50,000 genes to estimate size factor for normalization.
>> Then should I remove genes with low count before estimating variance or after that?
>> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets.
>>
>> 1) no filtering before estimating variance
>> d <- newCountDataSet( dat, group )
>> d <- estimateSizeFactors( d)
>> sf=sizeFactors(d)
>> d <- estimateVarianceFunctions( d )
>> res <- nbinomTest( d,"cond1","cond2")
>> resSig <- res[ res$padj < .1, ]
>>
>> 2) filtering before estimating variance
>> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes
>> filter.ds <- newCountDataSet( filter, group )
>> filter.ds$sizeFactor=sf ## use the sizefactor from 1)
>> filter.ds <- estimateVarianceFunctions( filter.ds )
>> filter.res <- nbinomTest( filter.ds,"cond1","cond2")
>> filter.sig=filter.res[ filter.res$padj < .1, ]
>> nrow(filter.sig)
>>
>> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2).
>>
>> 2. For EdgeR
>> 1) when should I remove low count genes?
>>  f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right?
>>  f <- f/exp(mean(log(f)))
>>  dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts?
>>  d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
>>
>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05)
>>
>> Any help or suggestions will be appreciated.
>>
>> Thanks,
>> Xiaohui
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:12}}


More information about the Bioconductor mailing list