[BioC] Filtering out tags with low counts in DESeq and EgdeR?

Biase, Fernando biase at illinois.edu
Sat May 21 23:09:16 CEST 2011


Hi, 

I understand that filtering the dataset based on all the samples is more adequate than per experimental group. However, if one has unbalanced samples, is it still valid?

Assuming one group has 10 samples (A) and other group has 5 samples (B) sequenced. If I filter by total number of reads for the 15 samples, I would eliminate many more genes that are expressed at the lower range from group B, compared to the genes expressed at the lower range in the group A.

Is there a way around it for experiments performed with unbalanced number of samples?

Best regards,

Fernando

________________________________________
From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wolfgang Huber [whuber at embl.de]
Sent: Saturday, May 21, 2011 9:07 AM
To: bioconductor at r-project.org
Subject: Re: [BioC] Filtering out tags with low counts in DESeq and EgdeR?

Hi Xiaohui

to follow up on the filtering question:

- the filter that Xiaohui applied is invalid, it will distort the
null-distribution of the test statistic and lead to invalid p-values.
This might explain the discrepancy.

- the filter that Simon suggested is OK and should provide better results.

- I'd also be keen to hear about your experience with this.

A valid filtering criterion does not change the null distribution of the
subsequently applied test statistic (it can, and in fact should, change
the alternative distribution(s)). In practice, this means choosing a
filter criterion that is statistically independent, under the null, from
the test statistic, and in particular, that it does not use the class
labels. Details in the below-cited PNAS paper.

        Best wishes
        Wolfgang





Il May/21/11 11:02 AM, Simon Anders ha scritto:
> Hi Xiaohui
>
> I agree thatit is worrying to get so different results from your two
> approaches of using DESeq. Here are a few suggestion how you might
> investigate this (and I'd be eager to hear about your findings):
>
> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering
> affects the validity and power of a test. They stress that it is
> important that the filter is blind to the sample labels (actually: even
> permutation invariant). So what you do here is not statistically sound:
>
>  > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>
> Try instead something like:
>
> filter=dat[rowSums(dat) >= 16, ]
>
> - How does your filter affect the variance functions? Do the plots
> generated by 'scvPlot()' differ between the filtered and the unfiltered
> data set?
>
> - If so, are the hits that you get at expression strength were the
> variance functions differ? Are they at the low end, i.e., where the
> filter made changes?
>
> - Have you tried what happens if you filter after estimating variance?
> The raw p values should be the same as without filtering, but the
> adjusted p values might get better.
>
> To be honest, I'm currently a bit at a loss which one is more correct:
> Filtering before or after variance estimation. Let's hear what other
> people on the list think.
>
>> 2. For EdgeR
>
> DESeq and edgeR are sufficiently similar that any correct answer
> regarding filtering should apply to both.
>
>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>> adjusting p.value, is this possible? Then, can I used the *unadjusted*
>> p.value to get DE genes?
>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>> "BH")< 0.05)
>
> Of course, this is possible. (Read up on the "multiple hypothesis
> testing problem" if this is unclear to you.) Not also, though, that you
> used an FDR of .1 in your DESeq code but of .05 here.
>
> Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


--


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list