[R] Random numbers negatively correlated?

Thomas Lumley tlumley at u.washington.edu
Tue Jun 27 20:07:52 CEST 2006


On Tue, 27 Jun 2006, Christian Hennig wrote:

> Dear list,
>
> I did simulations in which I generated 10000
> independent Bernoulli(0.5)-sequences of length 100. I estimated
> p for each sequence and I also estimated the conditional probability that
> a one is followed by another one (which should be p as well).
> However, the second probability is significantly smaller than 0.5 (namely
> about 0.494, see below) and of course smaller than the direct estimates of
> p as well, indicating negative correlation between the random numbers.
>
> See below the code and the results.
> Did I do something wrong or are the numbers in
> fact negatively correlated? (A type I error is quite unlikely with a
> p-value below 2.2e-16.)

I think you did something wrong, and that there is a problem with 
overlapping blocks of two.

If you do
   x<-matrix(rbinom(1e6,1,p),ncol=2)
   tt<-table(x[,1],x[,2])

you get much better looking results. In this case you have 500,000 
independent pairs of numbers that can be 01, 10, 11, 00. A test for 
independence seems fine.

> tt

          0      1
   0 125246 124814
   1 125140 124800
> fisher.test(tt)

         Fisher's Exact Test for Count Data

data:  tt
p-value = 0.8987
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.9896745 1.0119211
sample estimates:
odds ratio
   1.000735


In your case the deficit in est11 is suspiciously close to 0.5/n. Changing 
n to 1000 and using the same seed I get
> mean(est11)-0.5
[1] -0.0005534743
10 times smaller, and still close to 0.5/n.

Now, consider what happens in a case where we can see all the 
possibilities, n=3

x    1/0  1/1 
000   -   -
001   -   -
010   1   0
011   0   1
100   1   0
101   1   0
110   1   1
111   2   0

So that if each of these three triplets has the same probability your 
est11 would be 2/8 rather than 4/8, and est11 is not an unbiased estimate 
of the long-run conditional probability. The bias is of order 1/n, so you 
need n to be of larger order than  sqrt(simruns).


 	-thomas



More information about the R-help mailing list