[R] Generating uniformly distributed correlated data.

(Ted Harding) ted.harding at wlandres.net
Mon Feb 21 22:38:11 CET 2011


[See at end]

On 21-Feb-11 17:13:03, Larry Hotchkiss wrote:
>   Hi list:
> Here is one approach that will generate two uniformly distributed 
> variables with a correlation of  0.5 --
> [snip]
>> >  -----Urspr?ngliche Nachricht-----
>> >  Von:r-help-bounces at r-project.org
>> >  [mailto:r-help-bounces at r-project.org] Im Auftrag von S??ren Faurby
>> >  Gesendet: Sonntag, 20. Februar 2011 03:18
>> >  An:r-help at r-project.org
>> >  Betreff: [R] Generating uniformly distributed correlated data.
>> >  
>> >  I wish to generate a vector of uniformly distributed data
>> >  with a defined correlation to another vector
>> >  
>> >  The only function I have been able to find doing something
>> >  similar is corgen from the library ecodist.
>> >  
>> >  The following code generates data with the desired
>> >  correlation to the vector x but the resulting vector y is
>> >  normal and not uniform distributed
>> >  
>> >  library(ecodist)
>> >  x<- runif(105)
>> >  y<- corgen(x=x, r=.5)$y
>> >  
>> >  Do anyone know a similar function generating uniform
>> >  distributed data or a way of transforming y to the desired
>> >  distribution while keeping the correlation between x and y
>> >  
>> >  Kind regards, Soren

I have got intrigued by this problem! In his original query,
Søren Faurby did not specify the original vector 'x' for
which he wanted to generate uniformly distributed 'y' with
a given correlation 'r':

  "I wish to generate a vector of uniformly distributed
   data with a defined correlation to another vector"

In particular he did not state whether it should have any
specified distriobution -- he only specified that 'y' should
be uniformly distributed.

So I'm addressing the general problem in which 'x' may have
an arbitrary distribution (or be an arbitrary given vector),
and y is to have a uniform distribution and have a given
correlation 'r' with 'x'. In the following, I assume r>0.

I have an idea for an algorithm, though not sure how best to
write a general function for it. It is based on the idea that
you first sort 'x'; then generate 'y' as runif(length(x))
and sort 'y'. This generates a vector 'y' with maximal
correlation with 'x'. Now, you start at the two extreme ends
of 'y', and swap those two elements of 'y'. If the result
still exceeds the desired correlation 'x', you move in
and swap again, and so on. If you drop below the desired
correlation, then you back-track, and try swapping a pair
closer to the middle. And so on. Once you get really close,
but still just above 'r', you switch to working out from
the middle to fine-tune the result.

The basic idea is to permute the elements of 'y' to get
as close as possible to 'r'. But one needs to avoid an
exhaustive search through all permutations!

The following is an example, with length(x)=100 and 'x'
Normally distributed, and the desired correlation r=0.5,
was done by hand in about 20 minutes. Only the highlights
are presented.

set.seed(54321)
X0 <- rnorm(100) ; Y0 <- runif(100)
r <- 0.5

X1 <- sort(X0) ; Y1 <- sort(Y0)

# First try (goes a bit too far):
X <- X1 ; Y <- Y1
N <- length(Y)
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.4803957

# So back up and skip the (3<->98) swap)
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
T <- Y[5] ; Y[5] <- Y[N-4] ; Y[N-4] <- T # cor(X,Y) =  0.400471

# Following a few tries at (20,81), (30,71), (40,61):
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
T <- Y[45]; Y[45]<- Y[N-44]; Y[N-44]<- T # cor(X,Y) = 0.5017024

# Now grope more finely:
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
T <- Y[43]; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594

# And again:
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
T<- Y[43] ; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594
T<- Y[49] ; Y[49]<- Y[N-48]; Y[N-48]<- T # cor(X,Y) = 0.500079
T<- Y[50] ; Y[50]<- Y[N-49]; Y[N-49]<- T # cor(X,Y) = 0.5000634

### And this is probably as close as one can get. Pretty close!

Now one can plot(X,Y). Maybe the resulting cross-shape (more
or less inevitable from such a procedure) is undesirable.
But at least the desiderata stated in the query have been met.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 21-Feb-11                                       Time: 21:38:06
------------------------------ XFMail ------------------------------



More information about the R-help mailing list