[R] . Re: Splitting a data column randomly into 3 groups

Sat Sep 4 19:58:18 CEST 2021

Thomas,

There are many approaches tried over the years to do partitioning along the
lines you mentioned and others. R already has many built-in or in packages
including some that are quite optimized. So anyone doing serious work can
often avoid doing this the hard way and build on earlier work.

Now, obviously, some people learning may take on such challenges or have
them assigned as homework. And if you want to tweak the efficiency, you may
be able to do things like knowing the conditions needed by sample() are met,
you can directly call sample.int() and so on.

But fundamentally, a large subset of all these kinds of sampling can often
be done by just playing with indices. It does not matter whether your data
is in the form of a list or other kind of vector or a data.frame or matrix.
Anything you can subset with integers will do.

So an algorithm could determine how many subsets of the indices you want and
calculate how many you want in each bucket and it can be done fairly simply.
One approach might be to scramble the indices in some form, and that can be
a vector of them or something more like an unordered set. You then take the
first number of them as needed for the first partition then the next ones
for the additional partitions. Finally, you apply the selected ones to
subset the original data into multiple smaller data collections.

Obviously you can instead work in stages, if you prefer. Your algorithm
seems to be along those lines. Start with your full data and pull out what
you want for the first partition. Then with what is left, repeat for the
second partition and so on, till what is left is used for the final
partition. Arguably this may in some ways be more work, especially for
larger amounts of data.

I do note many statistical processes, such as bootstrapping, may include
allowing various kinds of overlap in which the same items are allowed to be
used repeatedly, sometimes even within a single random sample. In those
cases, the algorithm has to include replacement and that is a somewhat
different discussion.

What I am finding here is that many problems posed are not explained in the
way they turn out to be needed in the end. So answering a question before we
know what it is can be a tad premature and waste time all around. But in
reality, some people new to R or computing in general, may be stuck
precisely on understanding the question they are trying to solve or may not
realize their use of language (especially when English is not one of their
stronger languages) can be a problem as their listeners/readers assume they
mean something else.

Your suggestion of how to do some things is reasonable and you note a
question about larger amounts of data. Many languages, for example Python,
will often have higher-level abilities arguable better designed for some
problems. R was built with vectorization in mind and most things are
ultimately vectors in a sense. For some purposes, it would be nice to have
an implementation of primitives along the lines of sets and bags and
hashes/dictionaries and so on. People have added some things along these
lines but if you look at the implementation of your use of setdiff(), it
just calls unique on two arguments coerced into vector format!

So consider what happens if you instead start in a language where you can
use a native construct called a set where sets are implemented efficiently.
To make N groupings that are distinct, you might start by adding all the
indices, or even complete entities, into the set. You can then ask to get a
random element from the set (perhaps also with deletion) until you have
reached your N items. You can then ask for the next group of N' and then N''
till you have what you need and perhaps the set is empty. An implementation
that uses some form of hashing to store and access set items can make
finding things take the same amount of time no matter the size. There is no
endless copying of parts of data. There may be no need for a setdiff() step
or if used, a more efficient way it is done. 

And, of course, if your data can have things like redundant elements or
rows, some kind of bag primitive may be useful and you can do your
scrambling and partitioning without using obvious indices.

R has been used extensively for a long time and most people use what is
already there. Some create new object types or packages, of course.

I have sometimes taken a hybrid approach and one interesting one is to use a
hybrid environment. I have, for example, done things like the above by
writing a program that has a package that allows both an R and a Python
interpreter to work together on data they sort of share and at times
interconvert. In an example where you have lots of statistical routines you
trust in R but want some of the preparation and arrangement or further
analysis, to be done with functionality you have in Python, you can sort of
combine the best of both worlds into a single program. Heck, the same
environment may also be creating a document in a markup language where the
above language codes are embedded and selectively output results including
graphics and result in something like a PDF or other document form.

I know the current discussion was, sort of, about dividing a set of data
into three groups. Efficiency may not be a major consideration, especially
as what is done with the groups may be the dominant user of resources. But
many newer techniques, such as Random Forest, may involve taking the same
data and repeatedly partitioning it and running an analysis and repeating
many thousands of times and then in some way combining the results. Some
break it down over multiple stages till they have just a few items. Some
will allow strange things like making small groups that could in theory
consist of the same original item repeated several times even if in the real
world that makes no sense as it may improve the results. So the partitioning
part may well best be done well.

-----Original Message-----
From: Thomas Subia <tgs77m using yahoo.com> 
Sent: Saturday, September 4, 2021 12:26 PM
To: r-help using r-project.org
Cc: avigross using verizon.net; abouelmakarim1962 using gmail.com
Subject: . Re: Splitting a data column randomly into 3 groups

I was wondering if this is a good alternative method to split a data column
into distinct groups.
Let's say I want my first group to have 4 elements selected randomly

mydata <- LETTERS[1:11]
 random_grp <- sample(mydata,4,replace=FALSE)

Now random_grp is:
> random_grp
[1] "H" "E" "A" "D"
# How's that for a random selection!

Now my choices for another group of random data now becomes:
 data_wo_random <- setdiff(mydata,random_grp)

> data_wo_random
[1] "B" "C" "F" "G" "I" "J" "K"

Now from this reduced dataset, I can generate another random selection with
any size I choose.

One problem with this is that this is cumbersome when ones original dataset
is large or when one wants to subgroup the original dataset into many
different subgroup sizes.

Nevertheless, it's an intuitive method which is relatively easy to
understand

Hope this helps!

Thomas Subia
Statistician