[R] Depletion of small p values upon iterative testing of identical normal distributions

Mon Sep 20 17:14:30 CEST 2010

On Mon, 20 Sep 2010, Duncan Murdoch wrote:

> On 20/09/2010 9:54 AM, A wrote:
>> Dear all,
>> 
>> I'm performing a t-test on two normal distributions with identical mean&
>> standard deviation, and repeating this tests a very large number of times 
>> to
>> describe an representative p value distribution in a null case. As a part 
>> of
>> this, the program bins these values in 10 evenly distributed bins between 0
>> and 1 and reports the number of observations in each bin. What I have
>> noticed is that even after 500,000 replications the number in my lowest bin
>> is consistently ~5% smaller than the number in all the other bins, which 
>> are
>> similar within about 1% of each other. Is there any reason, perhaps to do
>> with random number generation in R or the nature of the normal distribution
>> simulated by the rnorm function that could explain this depletion?
>
> No, equal sized bins should expect equal numbers of entries.  But your code 
> may have errors in it.
>

It's actually more interesting than that.

Here's the code in my preferred format for small simulations
one.test <- function (m = -0.065, s = 0.0837) 
{
     x <- rnorm(6, m, s)
     y <- rnorm(6, m, s)
     t.test(x, y)$p.value
}

pvals <- replicate(1e5, one.test())

Now, binning could be done wrong, and in any case isn't a good way to compare distributions, so we try
    qqplot(pvals, ppoints(1e5))
and, since the issue is with small p-values
    qqplot(log10(pvals), log10(ppoints(1e5)))

and we see that there are in fact fewer small p-values than there should be, starting at about 10^(-2.5).

After thinking about this for a while, we realize that the t.test() function shouldn't be expected to give exactly uniform p-values -- it's not as if it is doing an exact test, after all.

We smugly repeat the exercise with 
one.ev.test <- function (m = -0.065, s = 0.0837) 
{
     x <- rnorm(6, m, s)
     y <- rnorm(6, m, s)
     t.test(x, y, var.equal=TRUE)$p.value
}
ev.pvals <- replicate(1e5, one.test())

and the problem is solved.

        -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle