[BioC] Why edgeR ouput some p values larger than 1

Tue Jan 10 06:06:04 CET 2012

Hi Yuan,

OK, I see.  Thanks for alerting me to this and sending through your 
example data.

The short answer is that all the p-values that are greater than 1 should 
be equal to 1.  I have committed through a fix to the edgeR package today 
to make sure that this happens.

The problems arises because the default exact test method is to double the 
smaller of the two tail probabilities.  If both of the tail probabilities 
are greater than 0.5, then the final p-value ends up being > 1. The issue 
arises most commonly when the counts are so small that significant 
differential expression is impossible anyway.

For example, suppose there are two libraries in group A and one library in 
group B, and the counts for a particular gene are: 0 0 1 for the three 
libraries.  Suppose the overall library sizes are all equal.  The total 
count in group A is 0 and the total count in B in 1.  Given the fact that 
there is one count in total, the probability that it occurs in group A is 
2/3 and the probability it occurs in group B is 1/3, under the null 
hypothesis.  We actually observe the A count to be zero.

   Prob(A count <= 0) = 2/3
   Prob(A count >= 0) = 1

Hence both tail probabilities are > 1/2.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Sun, 8 Jan 2012, Yuan Tian wrote:

> Dear Gordon,
>
> In my dataset, there are 1509 genes that got a p.value larger than1. I 
> extract out the read counts of those genes and make them into a new 
> dataset named as "dat" (also attached as "test.csv"). I ran the 
> following codes, and I still got 1034 genes with pvalue larger than 1...
>
>> conds
> [1] 1 2 1 1 1 1 1 2 2 1 1
>> d <- DGEList(counts=dat, group=conds)
>> d <- calcNormFactors(d, method="TMM")
>> d <- estimateCommonDisp(d)
>> d<-estimateTagwiseDisp(d)
>> de=exactTest(d)
> Comparison of groups:  2 - 1
>
>> summary(de$table$p.value)
>     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
> 0.0000001 0.9372000 1.0000000 0.9548000 1.1750000 1.4550000
>
> Most of problematic genes are actually the genes that have 0 read count 
> in some samples, but would this be a problem for the DE test?
>
> Best,
>
> Yuan

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}