[R] Binning question (binning rows of a data.frame according to a variable)

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sat Mar 18 11:44:55 CET 2006


Gabor Grothendieck wrote:
> If you are just looking for something simple that may be good enough
> then assign the largest one to group 1, the second largest to group 2,
> ..., the 8th largest to group 8 and then start over again with group 1
> and so on.
> 
> # test data
> set.seed(1)
> x <- sample(100, 100, rep = TRUE)
> 
> xs <- sort(x)
> g <- gl(8, 1, length(xs)) # 8 groups
> 
> # so that g contains the groups that correspond to xs.
> 
> tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
> 


That is a fairly neat way of getting groups with a good 'approximate 
same size', however, in general I would like to be able to order my data 
in any way, and still cut it into equal 'size' groups (like quantiles 
for rows, but for row variable totals instead).

Seems it should be possible without an explicit loop (and some more 
'refinement' of the final group sizes), but I can't work it out.




> 
> On 3/17/06, Dan Bolser <dmb at mrc-dunn.cam.ac.uk> wrote:
> 
>>Dan Bolser wrote:
>>
>>>Hi,
>>>
>>>I have tuples of data in rows of a data.frame, each column is a variable
>>>for the 'items' (one per row).
>>>
>>>One of the variables is the 'size' of the item (row).
>>>
>>>I would like to cut my data.frame into groups such that each group has
>>>the same *total size*. So, assuming that we order by size, some groups
>>>should have several small items while other groups have a few large
>>>items. All the groups should have approximately the same total size.
>>>
>>>I have tried various combinations of cut, quantile, and ecdf, and I just
>>>can't work out how to do this!
>>>
>>>Any help is greatly appreciated!
>>>
>>>All the best,
>>>Dan.
>>>
>>
>>Perhaps there is a cleaver way, but I just wrote this in despiration...
>>
>>
>>my.groups <- 8
>>
>>my.total <-
>>  sum(my.res.1$TOT)   ## The 'size' variable in my data.frame
>>
>>my.approx.size <-
>>  my.total/
>>  my.groups
>>
>>my.j <- 1
>>my.roll <- 0
>>my.factor <- numeric()
>>
>>for(i in sort(my.res.1$TOT)){
>>
>>  my.roll <-
>>    my.roll + i
>>
>>  if (my.roll > my.approx.size * my.j)
>>    my.j <- my.j + 1
>>
>>  my.factor <-
>>    append(my.factor,my.j)
>>}
>>
>>my.factor <-
>>  as.factor(my.factor)
>>
>>
>>
>>Then...
>>
>> > tapply(my.factor,my.factor,length)
>>  1   2   3   4   5   6   7   8
>>152  62  45  34  25  21  14   8
>>
>>
>>And...
>>
>> > tapply(sort(my.res.1$TOT),my.factor,sum)
>>   1    2    3    4    5    6    7    8
>>2880 2848 2912 2893 2832 2906 2776 3029
>> >
>>
>>
>>
>>Which isn't bad.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>




More information about the R-help mailing list