[R] simulate correlated binary, categorical and continuous variable

Petr Savicky savicky at cs.cas.cz
Mon Apr 2 18:21:17 CEST 2012


On Sun, Apr 01, 2012 at 06:00:43PM -0700, Burak Aydin wrote:
> Hello Greg,
> Sorry for the confusion.
> Lets say, I have a population.  I have 6 variables. They are correlated to
> each other. I can get you pearson correlation, tetrachoric or polychoric
> correlation coefficients.
> 2 of them continuous, 2 binary, 2 categorical.
> Lets assume following conditions;
> Co1 and Co2 are normally distributed continuous random variables. Co1-- N
> (0,1), Co2--N(100,15)
> Ca1 and Ca2 are categorical variables. Ca1 probabilities
> =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76)
> Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4,  Bi2 p=0.5.
> And , again, I have the correlations.
> 
> When I try to simulate this population I fail. If I keep the means and
> probabilities same I lost the correct correlations. When I keep
> correlations, I loose precision on means and frequencies/probabilities.

Hi.

One idea, which occured to me, is the following. Formulate a model of
the joint distribution with some parameters and a criterion function,
which measures how much the data generated from the model differ from
the required marginal distributions and the required correlations. Then
run an optimization of the parameters to minimize the difference.

If you have enough data, then the model can be a table of estimated
probabilities for all 5*3*2*2 = 60 combinations of the discrete
variables and for each of these combinations the parameters of the
conditional distribution on the 2 continuous variables, which can
be a bivariate normal distribution. However, you probably do not have
enough data for this.

Another approach starts from the distribution of the continuous
variables and the model for the discrete variables can be a logistic
model using the continuous variables as input.

Another type of a model, which may be suitable, is a Bayesian network.
For this, you need to choose only a subset of the most important dependencies,
so that the selected dependencies can be represented by a directed acyclic
graph.

Petr Savicky.



More information about the R-help mailing list