[R] Binning question (binning rows of a data.frame according to a variable)

Sat Mar 18 13:56:40 CET 2006

On 3/18/06, Dan Bolser <dmb at mrc-dunn.cam.ac.uk> wrote:
> Gabor Grothendieck wrote:
> > If you are just looking for something simple that may be good enough
> > then assign the largest one to group 1, the second largest to group 2,
> > ..., the 8th largest to group 8 and then start over again with group 1
> > and so on.
> >
> > # test data
> > set.seed(1)
> > x <- sample(100, 100, rep = TRUE)
> >
> > xs <- sort(x)
> > g <- gl(8, 1, length(xs)) # 8 groups
> >
> > # so that g contains the groups that correspond to xs.
> >
> > tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
> >
>
>
> That is a fairly neat way of getting groups with a good 'approximate
> same size', however, in general I would like to be able to order my data
> in any way, and still cut it into equal 'size' groups (like quantiles
> for rows, but for row variable totals instead).

Do you mean you want g to be in the original order of x?  order(x)
is the permutation which sorts x and order(order(x)) is its inverse
permutation so apply that to the gl expression:

x <- c(10, 4, 15, 2, 20, 13)
g <- gl(2, 1, length(x))[order(order(x))]

# check it
identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum))

>
> Seems it should be possible without an explicit loop (and some more
> 'refinement' of the final group sizes), but I can't work it out.
>
>
>
>
> >
> > On 3/17/06, Dan Bolser <dmb at mrc-dunn.cam.ac.uk> wrote:
> >
> >>Dan Bolser wrote:
> >>
> >>>Hi,
> >>>
> >>>I have tuples of data in rows of a data.frame, each column is a variable
> >>>for the 'items' (one per row).
> >>>
> >>>One of the variables is the 'size' of the item (row).
> >>>
> >>>I would like to cut my data.frame into groups such that each group has
> >>>the same *total size*. So, assuming that we order by size, some groups
> >>>should have several small items while other groups have a few large
> >>>items. All the groups should have approximately the same total size.
> >>>
> >>>I have tried various combinations of cut, quantile, and ecdf, and I just
> >>>can't work out how to do this!
> >>>
> >>>Any help is greatly appreciated!
> >>>
> >>>All the best,
> >>>Dan.
> >>>
> >>
> >>Perhaps there is a cleaver way, but I just wrote this in despiration...
> >>
> >>
> >>my.groups <- 8
> >>
> >>my.total <-
> >>  sum(my.res.1$TOT)   ## The 'size' variable in my data.frame
> >>
> >>my.approx.size <-
> >>  my.total/
> >>  my.groups
> >>
> >>my.j <- 1
> >>my.roll <- 0
> >>my.factor <- numeric()
> >>
> >>for(i in sort(my.res.1$TOT)){
> >>
> >>  my.roll <-
> >>    my.roll + i
> >>
> >>  if (my.roll > my.approx.size * my.j)
> >>    my.j <- my.j + 1
> >>
> >>  my.factor <-
> >>    append(my.factor,my.j)
> >>}
> >>
> >>my.factor <-
> >>  as.factor(my.factor)
> >>
> >>
> >>
> >>Then...
> >>
> >> > tapply(my.factor,my.factor,length)
> >>  1   2   3   4   5   6   7   8
> >>152  62  45  34  25  21  14   8
> >>
> >>
> >>And...
> >>
> >> > tapply(sort(my.res.1$TOT),my.factor,sum)
> >>   1    2    3    4    5    6    7    8
> >>2880 2848 2912 2893 2832 2906 2776 3029
> >> >
> >>
> >>
> >>
> >>Which isn't bad.
> >>