[R] Generating uniformly distributed correlated data.

Mon Feb 21 17:58:34 CET 2011

On Feb 21, 2011, at 9:53 AM, Erich Neuwirth wrote:

> We want to generate a distribution on the unit square with the  
> following
> properties
> * It is concentrated on a "reasonable" subset of the square,
>  and the restricted distribution is uniform on this subset.
> * Both marginal distributions are uniform on the unit interval.
> * All horizontal and all vertical cross sections are sets of lines
>  segments with the same total length
>
> If we find a geometric figure with these properties, we have solved  
> the
> problem.
>
> So we define the distribution to be uniform on the following area:
> (it is distorted but should give the idea)
>
> x***/-----------------/***x
> |**/-----------------/****|
> |*/-----------------/*****|
> |/-----------------/******|
> |-----------------/******/|
> |----------------/******/-|
> |---------------/******/--|
> |--------------/******/---|
> |-------------/******/----|
> |------------/******/-----|
> |-----------/******/------|
> |----------/******/-------|
> |---------/******/--------|
> |--------/******/---------|
> |-------/******/----------|
> |------/******/-----------|
> |-----/******/------------|
> |----/******/-------------|
> |---/******/--------------|
> |--/******/---------------|
> |-/******/----------------|
> |/******/-----------------|
> |******/-----------------/|
> |*****/-----------------/*|
> |****/-----------------/**|
> x***/-----------------/***x
>
> There is the same number of stars in each horizontal row and each
> vertical column.

I love it! I plotted the data that your method generated and got a  
plot with points in the regions you displayed. After pondering its  
curious pathology (three disjoint regions), I thought I could cure  
that pathology by flipping quadrants. I proceeded to do so but  
discovered that the pathology wasn't really cured but only  
concentrated at the ends and middle. Using your ASCII art, my version   
wuld look like this where the \\\ regions are "double dense".

y[x>0.5 & y <0.5] = 0.5 +abs(0.5-y[x>0.5 & y <0.5])
y[x<0.5 & y >0.5] = 0.5 -abs(0.5-y[x<0.5 & y >0.5])
plot(x,y)

x---------------------\\\\x
|--------------------/*\\\|
|-------------------/***\\|
|------------------/*****\|
|-----------------/******/|
|----------------/******/-|
|---------------/******/--|
|--------------/******/---|
|-------------/******/----|
|-------------|\\***/-----|
|-------------|\\\*/------|
|-------------|\\\\-------|
|-------------*-----------|
|--------/\\\\------------|
|-------/**\\\------------|
|------/****\\------------|
|-----/******\------------|
|----/******/-------------|
|---/******/--------------|
|--/******/---------------|
|-/******/----------------|
|/******/-----------------|
|\*****/------------------|
|\\***/-------------------|
|\\\\/--------------------|
\\\\\---------------------x

So it not only created a still slightly less pathological counterpart,  
but the correlation jumped from 0.5 to 0.95. It looks to be a  
promising basis for homework problems in probability courses, an  
experience I have never has the ?pleasure? to experience except during  
self-study to repair my (many) mathematical deficiencies.

 > cor(x,y)
[1] 0.9449256

-- 
David.

>
>
> So we define
> g(x1,x2)= 1 abs(x1-x2) <= a or
>            abs(x1-x2+1) <= a or
>            abs(x1-x2-1) <= a
>          0 elsewhere
>
> The total area of the shape is 2*a.
> The admissible range for a is <0,1/2>
> therefore
> f(x1,x2)=g(x1,x2)/(2*a)
> is a density functions.
> This is where simple algebra comes in.
> This distribution has
> expected value 1/2 and variance 1/12 for both margins
> (uniform distribution), and it has
> covariance = (1-3*a+2*a2)/12
> and correlation = 1 - 3*a + 2*a2
>
> The inverse function of 1 - 3*2 + 2*a2 is
> (3-sqrt(1+8*r))/4
>
> Therefore we can compute that our distribution with
> a=(3-sqrt(1+8*r))/4
> will produce a given r.
>
>
> Ho do we create random numbers from this distribution?
> By using conditional densities.
> x1 is sampled from the uniform distribution, and for a give x1
> we produce x2 by a uniform distribution on the along the vertical  
> cross
> cut of the geometrical shape (which is either 1 or 2 intervals).
> And which is most easily implemented by using the modulo operator %%.
>
> This mechanism is NOT a convolution. Applying module after the  
> addition
> makes it a nonconvolution. Adding independent random variables
> without doing anything further is a convolution, by applying a  
> trimming
> operation, the convolution property gets lost.

-- 
David Winsemius, MD
West Hartford, CT