[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Sun May 22 10:59:23 CEST 2011

Dear Fernando

(un)balanace of group sizes does not play a role. What is important is 
that the test statistic for differential expression is statistically 
independent from the filter criterion *under the null hypothesis* of no 
differential expression.

To very good approximation, this is the case for
- the row sums of the count matrix
- the negative binomial test that DESeq and edgeR perform

Thus, filtering is OK also for experiments performed with unbalanced 
number of samples.

	Wolfgang

Il May/21/11 11:09 PM, Biase, Fernando ha scritto:
> Hi,
>
> I understand that filtering the dataset based on all the samples is more adequate than per experimental group. However, if one has unbalanced samples, is it still valid?
>
> Assuming one group has 10 samples (A) and other group has 5 samples (B) sequenced. If I filter by total number of reads for the 15 samples, I would eliminate many more genes that are expressed at the lower range from group B, compared to the genes expressed at the lower range in the group A.
>
> Is there a way around it for experiments performed with unbalanced number of samples?
>
> Best regards,
>
> Fernando
>
> ________________________________________
> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wolfgang Huber [whuber at embl.de]
> Sent: Saturday, May 21, 2011 9:07 AM
> To: bioconductor at r-project.org
> Subject: Re: [BioC] Filtering out tags with low counts in DESeq and EgdeR?
>
> Hi Xiaohui
>
> to follow up on the filtering question:
>
> - the filter that Xiaohui applied is invalid, it will distort the
> null-distribution of the test statistic and lead to invalid p-values.
> This might explain the discrepancy.
>
> - the filter that Simon suggested is OK and should provide better results.
>
> - I'd also be keen to hear about your experience with this.
>
> A valid filtering criterion does not change the null distribution of the
> subsequently applied test statistic (it can, and in fact should, change
> the alternative distribution(s)). In practice, this means choosing a
> filter criterion that is statistically independent, under the null, from
> the test statistic, and in particular, that it does not use the class
> labels. Details in the below-cited PNAS paper.
>
>          Best wishes
>          Wolfgang
>
>
>
>
>
> Il May/21/11 11:02 AM, Simon Anders ha scritto:
>> Hi Xiaohui
>>
>> I agree thatit is worrying to get so different results from your two
>> approaches of using DESeq. Here are a few suggestion how you might
>> investigate this (and I'd be eager to hear about your findings):
>>
>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering
>> affects the validity and power of a test. They stress that it is
>> important that the filter is blind to the sample labels (actually: even
>> permutation invariant). So what you do here is not statistically sound:
>>
>>   >  filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>>
>> Try instead something like:
>>
>> filter=dat[rowSums(dat)>= 16, ]
>>
>> - How does your filter affect the variance functions? Do the plots
>> generated by 'scvPlot()' differ between the filtered and the unfiltered
>> data set?
>>
>> - If so, are the hits that you get at expression strength were the
>> variance functions differ? Are they at the low end, i.e., where the
>> filter made changes?
>>
>> - Have you tried what happens if you filter after estimating variance?
>> The raw p values should be the same as without filtering, but the
>> adjusted p values might get better.
>>
>> To be honest, I'm currently a bit at a loss which one is more correct:
>> Filtering before or after variance estimation. Let's hear what other
>> people on the list think.
>>
>>> 2. For EdgeR
>>
>> DESeq and edgeR are sufficiently similar that any correct answer
>> regarding filtering should apply to both.
>>
>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>>> adjusting p.value, is this possible? Then, can I used the *unadjusted*
>>> p.value to get DE genes?
>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>>> "BH")<  0.05)
>>
>> Of course, this is possible. (Read up on the "multiple hypothesis
>> testing problem" if this is unclear to you.) Not also, though, that you
>> used an FDR of .1 in your DESeq code but of .05 here.
>>
>> Simon
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
>
>
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber