[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Sun May 22 06:10:29 CEST 2011

Thanks, Simon and Wolfgang.

to follow up Simon's suggestions:
> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>Try instead something like:
>filter=dat[rowSums(dat) >= 16, ]
>
>- How does your filter affect the variance functions? Do the plots 
>generated by 'scvPlot()' differ between the filtered and the unfiltered 
>data set?

I tried filter=dat[rowSums(dat) >= 16, ] or filter=dat[rowSums(dat) >= 8, ], the scvPlot of 8 or 16 is almost the same, but the plots of filtering and no filtering are different.
The two plots are attached, to me, they look different, but I don't know what excatly the differences are, could you look into the plots?

>- If so, are the hits that you get at expression strength were the 
>variance functions differ? Are they at the low end, i.e., where the 
>filter made changes?

Sorry, I don't know how to find out where the variance functions change, can we see this from the scvPlots?

>- Have you tried what happens if you filter after estimating variance? 
>The raw p values should be the same as without filtering, but the 
>adjusted p values might get better.

How to filter the CountDataSet? or could I just adjust p values (using some adjusted function?) on the unfiltered results of nbinomTest()?

d <- newCountDataSet( dat, group )
d <- estimateSizeFactors( d)
sf=sizeFactors(d)
d <- estimateVarianceFunctions( d )
... Then, how could I get the CountDataSet with filtered counts and variance?

>> 2. For EdgeR
>
>DESeq and edgeR are sufficiently similar that any correct answer 
>regarding filtering should apply to both.
>
>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH")<  0.05)
>
>Of course, this is possible. (Read up on the "multiple hypothesis 
>testing problem" if this is unclear to you.) Not also, though, that you 
>used an FDR of .1 in your DESeq code but of .05 here.

Actually, I tried 0.1 or 0.05, or other adjusted methods provided in EdgeR, always 0 DE genes after adjusted. 

> Il May/21/11 5:16 PM, Wolfgang ha scritto:
>Hi Martin,
>
>this is a very good question, and it needs more investigation. I'll have 
>a closer look into this and report back in this place. Two comments 
>though already:
>
>- That the dispersion parameter is estimated from the data need by 
>itself not be problematic (It is not problematic in the microattay case 
>that the t-test variances are estimated from the data.)
>
>- The situation with limma is different: there, the problem is that 
>limma's Bayesian model, which assumes that gene-level error variances 
>σ^2_1, ..., σ^2_m follow a scaled inverse χ2 distribution, no longer 
>fits the data when the data are filtered for genes with low overall 
>variance. Due to the way that limma is implemented, the posterior 
>degrees-of-freedom estimate of that distribution then becomes infinite, 
>gene-level variance estimates will be ignored (leading to an unintended 
>analysis based on fold change only), and type I error rate control is 
>lost. See Fig. 2B+C in the paper, and the text on p.3.
>
>So, what we need to do with DESeq is
>- simulate data according to the null model
>- see if & how filtering affects the estimated mean-dispersion relationship
>- see if & how it affects the type I error globally and locally (for 
>particular ranges of the mean).
>
>Another point is how much the filtering improves power - that will be 
>related to how many genes can actually be removed by the filtering step, 
>which depends on the data.
>
>	Best wishes
>	Wolfgang
>
>
>
>Il May/21/11 4:36 PM, Martin Morgan ha scritto:
>> On 05/21/2011 07:07 AM, Wolfgang Huber wrote:
>>> Hi Xiaohui
>>>
>>> to follow up on the filtering question:
>>>
>>> - the filter that Xiaohui applied is invalid, it will distort the
>>> null-distribution of the test statistic and lead to invalid p-values.
>>> This might explain the discrepancy.
>>>
>>> - the filter that Simon suggested is OK and should provide better
>>> results.
>>>
>>> - I'd also be keen to hear about your experience with this.
>>>
>>> A valid filtering criterion does not change the null distribution of the
>>> subsequently applied test statistic (it can, and in fact should, change
>>> the alternative distribution(s)). In practice, this means choosing a
>>> filter criterion that is statistically independent, under the null, from
>>> the test statistic, and in particular, that it does not use the class
>>> labels. Details in the below-cited PNAS paper.
>>
>> Was wondering whether, since the dispersion parameter is estimated from
>> the data, in some strict sense the filtering and testing procedures are
>> not independent under the null anyway? For the same reason that one
>> would not want to use a variance filter before a limma-style analysis,
>> if I understand correctly.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>>
>>>
>>> Il May/21/11 11:02 AM, Simon Anders ha scritto:
>>>> Hi Xiaohui
>>>>
>>>> I agree thatit is worrying to get so different results from your two
>>>> approaches of using DESeq. Here are a few suggestion how you might
>>>> investigate this (and I'd be eager to hear about your findings):
>>>>
>>>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering
>>>> affects the validity and power of a test. They stress that it is
>>>> important that the filter is blind to the sample labels (actually: even
>>>> permutation invariant). So what you do here is not statistically sound:
>>>>
>>>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>>>>
>>>> Try instead something like:
>>>>
>>>> filter=dat[rowSums(dat) >= 16, ]
>>>>
>>>> - How does your filter affect the variance functions? Do the plots
>>>> generated by 'scvPlot()' differ between the filtered and the unfiltered
>>>> data set?
>>>>
>>>> - If so, are the hits that you get at expression strength were the
>>>> variance functions differ? Are they at the low end, i.e., where the
>>>> filter made changes?
>>>>
>>>> - Have you tried what happens if you filter after estimating variance?
>>>> The raw p values should be the same as without filtering, but the
>>>> adjusted p values might get better.
>>>>
>>>> To be honest, I'm currently a bit at a loss which one is more correct:
>>>> Filtering before or after variance estimation. Let's hear what other
>>>> people on the list think.
>>>>
>>>>> 2. For EdgeR
>>>>
>>>> DESeq and edgeR are sufficiently similar that any correct answer
>>>> regarding filtering should apply to both.
>>>>
>>>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>>>>> adjusting p.value, is this possible? Then, can I used the *unadjusted*
>>>>> p.value to get DE genes?
>>>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>>>>> "BH")< 0.05)
>>>>
>>>> Of course, this is possible. (Read up on the "multiple hypothesis
>>>> testing problem" if this is unclear to you.) Not also, though, that you
>>>> used an FDR of .1 in your DESeq code but of .05 here.
>>>>
>>>> Simon
>>>>