[R] Help with simulation of unbalanced clustered data

Wed Dec 16 15:56:12 CET 2020

Thank you for the reminder, Jeff. I am new to R-help and so please
bear with my ignorance. This is not homework and here is a
reproducible example. The number of observations per cluster doesn't
follow the condition specified above though, I just used this to
convey my idea.

   > y <- rnorm(20)

> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
   id cluster           x           y
1   1       1  0.30003855  0.65325768
2   2       1 -1.00563626 -0.12270866
3   3       1  0.01925927 -0.41367651
4   4       1 -1.07742065 -2.64314895
5   5       1  0.71270333 -0.09294102
6   1       2  1.08477509  0.43028470
7   2       2 -2.22498770  0.53539884
8   3       2  1.23569346 -0.55527835
9   4       2 -1.24104450  1.77950291
10  5       2  0.45476927  0.28642442
11  1       3  0.65990264  0.12631586
12  2       3 -0.19988983  1.27226678
13  3       3 -0.64511396 -0.71846622
14  4       3  0.16532102 -0.45033862
15  5       3  0.43881870  2.39745248
16  1       4  0.88330282  0.01112919
17  2       4 -2.05233698  1.63356842
18  3       4 -1.63637927 -1.43850664
19  4       4  1.43040234 -0.19051680
20  5       4  1.04662885  0.37842390

After randomly adding and deleting some data, the unbalanced data become
like this:

                   id cluster     x     y

       1     1       1  0.895 -0.659
       2     2       1 -0.160 -0.366
       3     1       2 -0.528 -0.294
       4     2       2 -0.919  0.362
       5     3       2 -0.901 -0.467
       6     1       3  0.275  0.134
       7     2       3  0.423  0.534
       8     3       3  0.929 -0.953
       9     4       3  1.67   0.668
      10     5       3  0.286  0.0872
      11     1       4 -0.373 -0.109
      12     2       4  0.289  0.299
      13     3       4 -1.43  -0.677
      14     4       4 -0.884  1.70
      15     5       4  1.12   0.386
      16     1       5 -0.723  0.247
      17     2       5  0.463 -2.59
      18     3       5  0.234  0.893
      19     4       5 -0.313 -1.96
      20     5       5  0.848 -0.0613

Here is what I tried:
dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))),
round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),].
I know it is very inefficient. Also it just randomly deleted rows and
had no effects in adding rows to match the total number of
observations. Thank you for your help!

Best,

Liu

On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:

> This is R-help, not R-do-my-work-for-me. It is also not a homework help
> line. The Posting Guide is required reading. Assuming this is not homework,
> since each step in your problem definition can be mapped to a fairly basic
> operation in R (the sample function and indexing being key tools), you
> should be showing your work with a reproducible example that illustrates
> where you are stuck or why the result you are getting does not exhibit the
> desired properties.
>
> On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu using gmail.com>
> wrote:
> >Dear R experts,
> >
> >I want to simulate some unbalanced clustered data. The number of
> >clusters
> >is 20 and the average number of observations is 30. However, I would
> >like
> >to create an unbalanced clustered data per cluster where there are 10%
> >more
> >observations than specified (i.e., 33 rather than 30). I then want to
> >randomly exclude an appropriate number of observations (i.e., 60) to
> >arrive
> >at the specified average number of observations per cluster (i.e., 30).
> >The
> >probability of excluding an observation within each cluster was not
> >uniform
> >(i.e., some clusters had no cases removed and others had more
> >excluded).
> >Therefore in the end I still have 600 observations in total. How to
> >realize
> >that in R? Thank you for your help!
> >
> >Best,
> >
> >Liu
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>

	[[alternative HTML version deleted]]