[R] Help with simulation of unbalanced clustered data

Wed Dec 16 18:44:06 CET 2020

Sigh. You still haven't read the Posting Guide? HTML email causes problems with this mailing list so do send email using your mail client's plain text option.

You assert that

>The probability of excluding an observation within each cluster was not uniform

but having a different number excluded can either be due to having a different probability or due to equal probability but different random chance associated with the same probability.

>(i.e., some clusters had no cases removed and others had more excluded)

so this could occur various ways.

If you meant for the probability to vary, just how should it vary?

Also, changing your requirements mid-stream makes it very difficult to see what you really want to accomplish.

On December 16, 2020 6:56:12 AM PST, Chao Liu <psychaoliu using gmail.com> wrote:
>Thank you for the reminder, Jeff. I am new to R-help and so please
>bear with my ignorance. This is not homework and here is a
>reproducible example. The number of observations per cluster doesn't
>follow the condition specified above though, I just used this to
>convey my idea.
>
>   > y <- rnorm(20)
>
>> x <- rnorm(20)
>> z <- rep(1:5, 4)
>> w <- rep(1:4, each=5)
>> dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
>   id cluster           x           y
>1   1       1  0.30003855  0.65325768
>2   2       1 -1.00563626 -0.12270866
>3   3       1  0.01925927 -0.41367651
>4   4       1 -1.07742065 -2.64314895
>5   5       1  0.71270333 -0.09294102
>6   1       2  1.08477509  0.43028470
>7   2       2 -2.22498770  0.53539884
>8   3       2  1.23569346 -0.55527835
>9   4       2 -1.24104450  1.77950291
>10  5       2  0.45476927  0.28642442
>11  1       3  0.65990264  0.12631586
>12  2       3 -0.19988983  1.27226678
>13  3       3 -0.64511396 -0.71846622
>14  4       3  0.16532102 -0.45033862
>15  5       3  0.43881870  2.39745248
>16  1       4  0.88330282  0.01112919
>17  2       4 -2.05233698  1.63356842
>18  3       4 -1.63637927 -1.43850664
>19  4       4  1.43040234 -0.19051680
>20  5       4  1.04662885  0.37842390
>
>After randomly adding and deleting some data, the unbalanced data
>become
>like this:
>
>                   id cluster     x     y
>
>       1     1       1  0.895 -0.659
>       2     2       1 -0.160 -0.366
>       3     1       2 -0.528 -0.294
>       4     2       2 -0.919  0.362
>       5     3       2 -0.901 -0.467
>       6     1       3  0.275  0.134
>       7     2       3  0.423  0.534
>       8     3       3  0.929 -0.953
>       9     4       3  1.67   0.668
>      10     5       3  0.286  0.0872
>      11     1       4 -0.373 -0.109
>      12     2       4  0.289  0.299
>      13     3       4 -1.43  -0.677
>      14     4       4 -0.884  1.70
>      15     5       4  1.12   0.386
>      16     1       5 -0.723  0.247
>      17     2       5  0.463 -2.59
>      18     3       5  0.234  0.893
>      19     4       5 -0.313 -1.96
>      20     5       5  0.848 -0.0613
>
>Here is what I tried:
>dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))),
>round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),].
>I know it is very inefficient. Also it just randomly deleted rows and
>had no effects in adding rows to match the total number of
>observations. Thank you for your help!
>
>
>Best,
>
>Liu
>
>
>
>On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller
><jdnewmil using dcn.davis.ca.us>
>wrote:
>
>> This is R-help, not R-do-my-work-for-me. It is also not a homework
>help
>> line. The Posting Guide is required reading. Assuming this is not
>homework,
>> since each step in your problem definition can be mapped to a fairly
>basic
>> operation in R (the sample function and indexing being key tools),
>you
>> should be showing your work with a reproducible example that
>illustrates
>> where you are stuck or why the result you are getting does not
>exhibit the
>> desired properties.
>>
>> On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu using gmail.com>
>> wrote:
>> >Dear R experts,
>> >
>> >I want to simulate some unbalanced clustered data. The number of
>> >clusters
>> >is 20 and the average number of observations is 30. However, I would
>> >like
>> >to create an unbalanced clustered data per cluster where there are
>10%
>> >more
>> >observations than specified (i.e., 33 rather than 30). I then want
>to
>> >randomly exclude an appropriate number of observations (i.e., 60) to
>> >arrive
>> >at the specified average number of observations per cluster (i.e.,
>30).
>> >The
>> >probability of excluding an observation within each cluster was not
>> >uniform
>> >(i.e., some clusters had no cases removed and others had more
>> >excluded).
>> >Therefore in the end I still have 600 observations in total. How to
>> >realize
>> >that in R? Thank you for your help!
>> >
>> >Best,
>> >
>> >Liu
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>

-- 
Sent from my phone. Please excuse my brevity.