[R] Splitting a data column randomly into 3 groups

Sun Sep 5 00:34:41 CEST 2021

I have a more general problem for you.

Given n items and 2 <=g <<n , how do you divide the n items into g
groups that are as "equal as possible."

First, operationally define "as equal as possible."
Second, define the algorithm to carry out the definition. Hint: Note
that sum{m[i]} for i <=g must sum to n, where m[i] is the number of
items in the ith group.
Third, write R code for the algorithm. Exercise for the reader.

I may be wrong, but I think numerical analysts might also have a
little fun here.

Randomization, of course, is trivial.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sat, Sep 4, 2021 at 2:13 PM AbouEl-Makarim Aboueissa
<abouelmakarim1962 using gmail.com> wrote:
>
> Dear Thomas:
>
>
> Thank you very much for your input in this matter.
>
>
> The core part of this R code(s) (please see below) was written by *Richard
> O'Keefe*. I had three examples with different sample sizes.
>
>
>
> *First sample of size n1 = 204* divided randomly into three groups of sizes
> 68. *No problems with this one*.
>
>
>
> *The second sample of size n2 = 112* divided randomly into three groups of
> sizes 37, 37, and 38. BUT this R code generated three groups of equal sizes
> (37, 37, and 37). *How to fix the code to make sure that the output will be
> three groups of sizes 37, 37, and 38*.
>
>
>
> *The third sample of size n3 = 284* divided randomly into three groups of
> sizes 94, 95, and 95. BUT this R code generated three groups of equal sizes
> (94, 94, and 94). *Again*, h*ow to fix the code to make sure that the
> output will be three groups of sizes 94, 95, and 95*.
>
>
> With many thanks
>
> abou
>
>
> ###########  ------------------------   #############
>
>
> N1 <- 485
> population1.IDs <- seq(1, N1, by = 1)
> #### population1.IDs
>
> n1<-204                                        ##### in this case the size
> of each group of the three groups = 68
> sample1.IDs <- sample(population1.IDs,n1)
> #### sample1.IDs
>
> ####  n1 <- length(sample1.IDs)
>
>   m1 <- n1 %/% 3
>   s1 <- sample(1:n1, n1)
>   group1.IDs <- sample1.IDs[s1[1:m1]]
>   group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]]
>   group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]]
>
> groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)
>
> groups.IDs
>
>
> ####### --------------------------
>
>
> N2 <- 266
> population2.IDs <- seq(1, N2, by = 1)
> #### population2.IDs
>
> n2<-112                           ##### in this case the sizes of the three
> groups are(37, 37, and 38)
>                                           ##### BUT this codes generate
> three groups of equal sizes (37, 37, and 37)
> sample2.IDs <- sample(population2.IDs,n2)
> #### sample2.IDs
>
> ####  n2 <- length(sample2.IDs)
>
>   m2 <- n2 %/% 3
>   s2 <- sample(1:n2, n2)
>   group1.IDs <- sample2.IDs[s2[1:m2]]
>   group2.IDs <- sample2.IDs[s2[(m2+1):(2*m2)]]
>   group3.IDs <- sample2.IDs[s2[(m2*2+1):(3*m2)]]
>
> groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)
>
> groups.IDs
>
>
> ####### --------------------------
>
>
>
> N3 <- 674
> population3.IDs <- seq(1, N3, by = 1)
> #### population3.IDs
>
> n3<-284                           ##### in this case the sizes of the three
> groups are(94, 95, and 95)
>                                           ##### BUT this codes generate
> three groups of equal sizes (94, 94, and 94)
> sample2.IDs <- sample(population2.IDs,n2)
> sample3.IDs <- sample(population3.IDs,n3)
> #### sample3.IDs
>
> ####  n3 <- length(sample2.IDs)
>
>   m3 <- n3 %/% 3
>   s3 <- sample(1:n3, n3)
>   group1.IDs <- sample3.IDs[s3[1:m3]]
>   group2.IDs <- sample3.IDs[s3[(m3+1):(2*m3)]]
>   group3.IDs <- sample3.IDs[s3[(m3*2+1):(3*m3)]]
>
> groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)
>
> groups.IDs
>
> ______________________
>
>
> *AbouEl-Makarim Aboueissa, PhD*
>
> *Professor, Statistics and Data Science*
> *Graduate Coordinator*
>
> *Department of Mathematics and Statistics*
> *University of Southern Maine*
>
>
>
> On Sat, Sep 4, 2021 at 11:54 AM Thomas Subia <tgs77m using yahoo.com> wrote:
>
> > Abou,
> >
> >
> >
> > I’ve been following your question on how to split a data column randomly
> > into 3 groups using R.
> >
> >
> >
> > My method may not be amenable for a large set of data but it surely worth
> > considering since it makes sense intuitively.
> >
> >
> >
> > mydata <- LETTERS[1:11]
> >
> > > mydata
> >
> > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
> >
> >
> >
> > # Let’s choose a random sample of size 4 from mydata
> >
> > > random_grp1
> >
> > [1] "J" "H" "D" "A"
> >
> >
> >
> > Now my next random selection of data is defined by
> >
> > data_wo_random <- setdiff(mydata,random_grp1)
> >
> > # this makes sense because I need to choose random data from a set which
> > is defined by the difference of the sets mydata and random_grp1
> >
> >
> >
> > > data_wo_random
> >
> > [1] "B" "C" "E" "F" "G" "I" "K"
> >
> >
> >
> > This is great! So now I can randomly select data of any size from this set.
> >
> > Repeating this process can easily generate subgroups of your original
> > dataset of any size you want.
> >
> >
> >
> > Surely this method could be improved so that this could be done
> > automatically.
> >
> > Nevertheless, this is an intuitive method which I believe is easier to
> > understand than some of the other methods posted.
> >
> >
> >
> > Hope this helps!
> >
> >
> >
> > Thomas Subia
> >
> > Statistician
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.