# [R] Bucketing/Grouping Probabilities

Random Walker kinch1967 at gmail.com
Wed Nov 19 17:31:53 CET 2008

```Many Thanks! It's a good start for me.

x <- c(0.049,  0.129,  0.043,  0.013,  0.015, 0.040,  0.066,
0.038,  0.2040, 0.0221, 0.234, 0.0443, 0.0684, 0.035)
.
.
.

This gives me

old     new
1  0.0490 0.04155
2  0.1290 0.12900
3  0.0430 0.04155
4  0.0130 0.01670
5  0.0150 0.01670
6  0.0400 0.04155
7  0.0660 0.06720
8  0.0380 0.04155
9  0.2040 0.21900
10 0.0221 0.01670
11 0.2340 0.21900
12 0.0443 0.04155
13 0.0684 0.06720
14 0.0350 0.04155

which looks pretty good.

I'm still wondering if there is some more 'correct' way of
clustering/bucketing probabilities - something with a more
entropy/likelihood bent? As stated in first post, I'm too statistically
naive to know quite what I'm talking about here, but I would like do as well
as I can with my estimates for the cluster/bucket probabilities.

Gabor Grothendieck wrote:
>
> Try this:
>
> x <- c(1,  0.049,  0.129,  0.043,  0.013,  0.015, 0.040,  0.066,
> 0.038,  0.2040, 0.0221, 0.234, 0.0443, 0.0684, 0.035)
> cl <- kmeans(x, 5)
> cl
> newold <- with(cl, data.frame(old = x, new = centers[cluster]))
> newold
>
>
> On Wed, Nov 19, 2008 at 10:43 AM, Random Walker <kinch1967 at gmail.com>
> wrote:
>>
>> I have a list of entrants (1-14 in this example) in a competitive event
>> and
>> corresponding win probabilities for each entrant.
>>
>> [(1,  0.049), (2,  0.129), (3,  0.043), (4,  0.013), (5,  0.015), (6,
>> 0.040), (7,  0.066), (8,  0.038), (9,  0.204), (10, 0.022), (11, 0.234),
>> (12, 0.044), (13, 0.068), (14, 0.035)]
>>
>> So, of course Sum(ps) = 1.
>>
>> In order to make some subsequent computations more tractable, I wish to
>> cluster entrant win probabilities like so:
>>
>> [(1,  0.049), (2,  0.121), (3,  0.049), (4,  0.024), (5,  0.024), (6,
>> 0.049), (7,  0.072), (8,  0.049), (9,  0.185), (10, 0.024), (11, 0.185),
>> (12, 0.049), (13, 0.072), (14, 0.049)]
>>
>> viz. in this case I have 'bucketed' the entrant numbers against 5
>> representative probabilities and in subsequent computations will deem
>> (for
>> example) the win probability of 3 to be 0.049, so another way of
>> visualising
>> the result is:
>>
>> [((4, 5, 10), 0.024),
>>  ((3, 6, 8, 12, 14), 0.049),
>>  ((7, 13), 0.072),
>>  ((2), 0.121),
>>  ((11), 0.185)]
>>
>> and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185)
>> ~=
>> 1.
>>
>> My question is: What is the most 'correct' way to cluster these
>> probabilities? In my case the problem is not totally unconstrained. I
>> would
>> like to specify the number of buckets (probably will always wish to use
>> either 5 or 6), so I do not need an algorithm which determines the most
>> appropriate number of buckets given some cost function. I just need to
>> know
>> for a given number of buckets, which entrants go in which buckets and
>> what
>> is the representative probability for each bucket.
>>
>> The first thing which occurs to me is to sort probabilities into
>> ascending
>> order, generate all partitions of the list into (say) 5 buckets, and pick
>> the partition which minimises the sum of squared differences from the
>> mean
>> of each bucket summed over all buckets. If buckets were not associated
>> with
>> probabilities I would do this without a second thought... but I wonder if
>> this is the right thing to do here? I'm too statistically naive to know
>> one
>> way or the other.
>>
>> I would appreciate any suggestions re correct approach and also
>> (obviously)
>>
>> Many thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Bucketing-Grouping-Probabilities-tp20582544p20582544.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help