[R] Random data with correlation

Fri Mar 8 07:16:10 CET 2002

On Fri, 8 Mar 2002, Edwin Velds wrote:

> Hello all.
>
> First of all, I have only been using are a short time and I'm not an
> expert in statistics either.
>
> I have the following problem. I'm working with measurements of physical
> samples, each measurement has about 4000 variables. I have 33 of those
> samples. From those 400 variables I deduced through non-statiscal means
> that I needed about 200 of them. I read those into a data.frame with 200
> columns of 33 rows. As I expected, quite a few variables have a fairly
> high correlation with one another (as high as 0.96). The individual
> variables are about normally distributed according to the Shapiro-Wilks
> test.
>
> With this data, I need to generate new random samples. However, for them
> to be any good they need to have a similar correlation to the original
> samples. Searching the net I found that this could be accomplished by

Once upon a time people used to look up authoritative sources, things
called `monographs', to find the best available methods.  It's still a good
idea.

> calculating random samples with
>
>  Random-sample = M + R L
>
> where M = vector of means
>       R = a vector of snorm values
>       L = the upper triangle matrix of the Choleski decomposition of the
> covariance matrix.

(Strange notation: L is used for the lower-triangle of the Choleski
decomposition as used by Choleski.)

> I verified this with a simple hand made example and it works. But trying
> this with the real data got me into trouble. chol() tells me that the
> covariance matrix of my test data is not positive definite. Which I
> found odd because I was told that a real covariance matrix should be
> that.

Only non-negative definite.  With 33 samples you will only get
a rank-32 covariance matrix.

> After some testing I found that qr(cov-matrix)$rank=32 and
> qr(cov-matrix, tol=1e-20)$rank=200. Which probably means that the matrix
> is numerically badly conditioned.

Theoretically so too.

> But now I'm at a loss on how to proceed. Changing the tolerance gave me
> an answer there. But in for instance the det() function (using qr), I
> can't set the tolerance to pass on to qr and therefore can't get an
> answer. Could the reason for chol() failing be related to this too?

The determinant is 0.

> Or
> maybe there is another method of getting what I want by using less
> sensitive functions and avoiding the problem altogether?

Use mvrnorm in package MASS.  As you will see from its comments, it
avoids chol for precisely this sort of reason, and it has a tolerance
for producing a rank-deficient sample.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._