[R] Binning question (binning rows of a data.frame according to a variable)

Sun Mar 19 19:43:08 CET 2006

Gabor Grothendieck wrote:
> On 3/18/06, Dan Bolser <dmb at mrc-dunn.cam.ac.uk> wrote:
> 
>>Gabor Grothendieck wrote:
>>
>>>If you are just looking for something simple that may be good enough
>>>then assign the largest one to group 1, the second largest to group 2,
>>>..., the 8th largest to group 8 and then start over again with group 1
>>>and so on.
>>>
>>># test data
>>>set.seed(1)
>>>x <- sample(100, 100, rep = TRUE)
>>>
>>>xs <- sort(x)
>>>g <- gl(8, 1, length(xs)) # 8 groups
>>>
>>># so that g contains the groups that correspond to xs.
>>>
>>>tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
>>>
>>
>>
>>That is a fairly neat way of getting groups with a good 'approximate
>>same size', however, in general I would like to be able to order my data
>>in any way, and still cut it into equal 'size' groups (like quantiles
>>for rows, but for row variable totals instead).
> 
> 
> Do you mean you want g to be in the original order of x?  

No. What I mean is that I want to order x by any particular variable in 
my data.frame, then group over x such that each group has roughly the 
same sum.

I get the feeling I have missed a very simple trick.

> order(x)
> is the permutation which sorts x and order(order(x)) is its inverse
> permutation so apply that to the gl expression:
> 
> x <- c(10, 4, 15, 2, 20, 13)
> g <- gl(2, 1, length(x))[order(order(x))]
> 
> # check it
> identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum))
> 
> 
>>Seems it should be possible without an explicit loop (and some more
>>'refinement' of the final group sizes), but I can't work it out.
>>
>>
>>
>>
>>
>>>On 3/17/06, Dan Bolser <dmb at mrc-dunn.cam.ac.uk> wrote:
>>>
>>>
>>>>Dan Bolser wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>I have tuples of data in rows of a data.frame, each column is a variable
>>>>>for the 'items' (one per row).
>>>>>
>>>>>One of the variables is the 'size' of the item (row).
>>>>>
>>>>>I would like to cut my data.frame into groups such that each group has
>>>>>the same *total size*. So, assuming that we order by size, some groups
>>>>>should have several small items while other groups have a few large
>>>>>items. All the groups should have approximately the same total size.
>>>>>
>>>>>I have tried various combinations of cut, quantile, and ecdf, and I just
>>>>>can't work out how to do this!
>>>>>
>>>>>Any help is greatly appreciated!
>>>>>
>>>>>All the best,
>>>>>Dan.
>>>>>
>>>>
>>>>Perhaps there is a cleaver way, but I just wrote this in despiration...
>>>>
>>>>
>>>>my.groups <- 8
>>>>
>>>>my.total <-
>>>> sum(my.res.1$TOT)   ## The 'size' variable in my data.frame
>>>>
>>>>my.approx.size <-
>>>> my.total/
>>>> my.groups
>>>>
>>>>my.j <- 1
>>>>my.roll <- 0
>>>>my.factor <- numeric()
>>>>
>>>>for(i in sort(my.res.1$TOT)){
>>>>
>>>> my.roll <-
>>>>   my.roll + i
>>>>
>>>> if (my.roll > my.approx.size * my.j)
>>>>   my.j <- my.j + 1
>>>>
>>>> my.factor <-
>>>>   append(my.factor,my.j)
>>>>}
>>>>
>>>>my.factor <-
>>>> as.factor(my.factor)
>>>>
>>>>
>>>>
>>>>Then...
>>>>
>>>>
>>>>>tapply(my.factor,my.factor,length)
>>>>
>>>> 1   2   3   4   5   6   7   8
>>>>152  62  45  34  25  21  14   8
>>>>
>>>>
>>>>And...
>>>>
>>>>
>>>>>tapply(sort(my.res.1$TOT),my.factor,sum)
>>>>
>>>>  1    2    3    4    5    6    7    8
>>>>2880 2848 2912 2893 2832 2906 2776 3029
>>>>
>>>>
>>>>
>>>>Which isn't bad.
>>>>