[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Sat May 21 16:07:33 CEST 2011

Hi Xiaohui

to follow up on the filtering question:

- the filter that Xiaohui applied is invalid, it will distort the 
null-distribution of the test statistic and lead to invalid p-values. 
This might explain the discrepancy.

- the filter that Simon suggested is OK and should provide better results.

- I'd also be keen to hear about your experience with this.

A valid filtering criterion does not change the null distribution of the 
subsequently applied test statistic (it can, and in fact should, change 
the alternative distribution(s)). In practice, this means choosing a 
filter criterion that is statistically independent, under the null, from 
the test statistic, and in particular, that it does not use the class 
labels. Details in the below-cited PNAS paper.

	Best wishes
	Wolfgang

Il May/21/11 11:02 AM, Simon Anders ha scritto:
> Hi Xiaohui
>
> I agree thatit is worrying to get so different results from your two
> approaches of using DESeq. Here are a few suggestion how you might
> investigate this (and I'd be eager to hear about your findings):
>
> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering
> affects the validity and power of a test. They stress that it is
> important that the filter is blind to the sample labels (actually: even
> permutation invariant). So what you do here is not statistically sound:
>
>  > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>
> Try instead something like:
>
> filter=dat[rowSums(dat) >= 16, ]
>
> - How does your filter affect the variance functions? Do the plots
> generated by 'scvPlot()' differ between the filtered and the unfiltered
> data set?
>
> - If so, are the hits that you get at expression strength were the
> variance functions differ? Are they at the low end, i.e., where the
> filter made changes?
>
> - Have you tried what happens if you filter after estimating variance?
> The raw p values should be the same as without filtering, but the
> adjusted p values might get better.
>
> To be honest, I'm currently a bit at a loss which one is more correct:
> Filtering before or after variance estimation. Let's hear what other
> people on the list think.
>
>> 2. For EdgeR
>
> DESeq and edgeR are sufficiently similar that any correct answer
> regarding filtering should apply to both.
>
>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>> adjusting p.value, is this possible? Then, can I used the *unadjusted*
>> p.value to get DE genes?
>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>> "BH")< 0.05)
>
> Of course, this is possible. (Read up on the "multiple hypothesis
> testing problem" if this is unclear to you.) Not also, though, that you
> used an FDR of .1 in your DESeq code but of .05 here.
>
> Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber