[R] Generating a binomial random variable correlated with a

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Sun Apr 17 12:52:49 CEST 2005


On 16-Apr-05 Ashraf Chaudhary wrote:
> Ted:
> Thank you for your help. All I want is a binomial random
> variable that is correlated with a normal random variable
> with specified correlation. By linear I mean the ordinary
> Pearson correlation. I tried the following two methods,
> in each case the resulting correlation is substantially
> less than the one specified.   
> 
> Method I: Generate two correlated normals (using Cholesky
> decomposition method) and dichotomize one (0/1) to get the
> binomial. Method II: Generate two correlated variables, one
> binomial and one normal using the Cholesky decomposition methods.
> 
> Here is how I did:
> 
> X <- rnorm(100)          
> Y <- rnorm(100)           
> r<- 0.7
> Y1 <- X*r+Y*sqrt(1-r**2)     
> cor(X,Y1)         # Correlated normals using Cholesky decomposition
> cor(X>0.84,Y1)  # Method I
> 
>##
> X1 <- rbinom(100,1,0.5)    
> Y2 <- X1*r+Y*sqrt(1-r**2)  
> cor(X1,Y2);     # Method II

Hello Ashraf,

The above more explicit explanation certainly helps to clarify
your question! (I echo Bill's counsel to spell out essential
detail).

I'll lead on from your "second method" above, which makes it
explicit that, to each of your 100 values (Y) sampled from rnorm
you are sampling from {0,1} to get one of your 100 binomial
(n=1) variables (X1). However, in this second method you also
create the derived variable Y2 = X1*r+Y*sqrt(1-r**2), and
apparently you want this (Y2) to be the Normal variate which
is correlated with X1.

Unfortunately, given the way Y2 is constructed, it does not
have a Normal distribution. This is almost obvious from the
fact that, conditional on the number of 1s in X1, the values
of Y2 are a mixture of N(0,1-r^2) and N(r,1-r^2).

You can explore this in R by looking at the histogram of a
much larger sample of Y2. Thus:

  r<- 0.7
  Y <- rnorm(100000)
  X1 <- rbinom(100000,1,0.5)
  Y2 <- X1*r+Y*sqrt(1-r**2)
  m<-mean(Y2) ; s<-sd(Y2)
  hist(Y2,breaks=0.1*(-45:45))
  lines(0.1*(-45:45),0.1*100000*dnorm(0.1*(-45:45),m,s))

Do it once, and the fit looks pretty good. However, if you
repeat the above commands several times, you will observe
that, near the peak of the curve, there is a definite
tendency for the peak of the histogram to lie below the peak
of the fitted curve. This illustrates the fact that you are
looking at a superposition of two Normal curves, with identical
(since you have p=0.5 in the Binomial sample) proportions
and variances, but slightly shifted relative to each other
(by an amount r which is in magnitude less than 1). This
superposition is flatter at the top than any single Normal
curve would be, which is being reflected in the "deficiency"
of the histogram relative to the fitted curve

You can also see this theoretically, since your Y2 is

  Y2 = A*X1 + B*Y

so that

  P[Y2 <= y] = p*P[B*Y <= y] + (1-p)*P[B*Y <= y - A]

             = p*P[Y <= y/B] + (1-p)*P[Y <= (y-A)/B]

where p is the Binomial P[X1 = 1].

Hence the density function of Y2 is the derivative of this
w.r.to y, namely

  p*f(y/B)/B + (1-p)*f((y-A)/B)/B

where f(x) is the density function of the standard N(0,1).

While in the case of your example (see histograms) the
distribution of Y2 might be close enough to Normal for
practical purposes (but this depends on your practical
purpose), it is nonetheless better to try to avoid the
theoretical difficulty if possible.

So I'd suggest experimenting on the following lines.

1. Let X1 be a sample of size N using rbinom(N,1,p)
   (where, in general, p need not be 0.5)

2. Let Y be a sample of size N using rnorm(N,mu,sigma)
   (and again, in general, mu need not be 0 nor sigma 1).

This is as in your example. So far, you have a true
Binomial variate X1 and a true Normal variate Y.

3. X1 will have r 1s in it, and (N-r) 0s. Now consider
   setting up a correspondence between the 1s and 0s in
   X1 and the values in Y, in such a way that the 1s
   tend to get allocated to the higher values of Y and
   the 0s to lower values of Y. The resulting set of
   pairs (X1,Y) is then your correlated sample.

   In other words, permute the sample X1 and cbind the
   result to Y.

This is similar to your "dichotomy" method (and would
be almost identical if you simply allocated the r 1s
to the r largest Ys), but is more flexible since I'm
only saying "tend". In other words, consider sampling
from the N Binomial 0s and 1s according to a non-uniform
probability distribution.

The theoretical benefit from doing it this way (provided
you can arrange the sampling in (3) so as to get your
desired correlation) is that, by construction, the marginal
distributions of X1 and Y are exactly as they were to start
with, namely Binomial and Normal respectively.

Here is an example of a possible approach:

  X0 <- rbinom(100,1,0.5)
  Y <- rnorm(100) ; Y<-sort(Y)
  p0<-((1:100)-0.5)/100 ; p0<-p0/sum(p0); p<-cumsum(p0)
  r<-sum(X0); ix<-sample((1:100),r,replace=FALSE,prob=p)
  X1<-numeric(100); X1[ix]<-1;sum(X1)
  ## [1] 50
  cor(X1,Y)
  ## [1] 0.6150937

showing that it's quite easy to get close to your (example)
correlation of 0..7 by simple means. (Note that X1, as
constructed above, is equivalent to the original Binomial
sample X0, since it has the same number of 0s and 1s; it's
just a matter of rearanging these so as to tend to line
up with the higher values of Y).

To refine the method (on this approach) play with the way
that p is derived from p0 (the second line above). You might
look at some of the suggestions in Bill Venables' (private)
mail.

Even something as banal as

  X1 <- sort(rbinom(100,1,0.5))
  Y <- sort(rnorm(100))
  cor(X1,Y)
  ## [1] 0.768373

gives an interesting result, though this is far too inflexible
for general purposes.

There may even be a theoretical way of arranging the 1s
so as to get the desired correlation exactly on average
-- my nose indicates that this may be so, but I have not
followed it!

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 17-Apr-05                                       Time: 11:52:49
------------------------------ XFMail ------------------------------




More information about the R-help mailing list