# [R] Bucketing/Grouping Probabilities

Random Walker kinch1967 at gmail.com
Wed Nov 19 16:43:57 CET 2008

```I have a list of entrants (1-14 in this example) in a competitive event  and
corresponding win probabilities for each entrant.

[(1,  0.049), (2,  0.129), (3,  0.043), (4,  0.013), (5,  0.015), (6,
0.040), (7,  0.066), (8,  0.038), (9,  0.204), (10, 0.022), (11, 0.234),
(12, 0.044), (13, 0.068), (14, 0.035)]

So, of course Sum(ps) = 1.

In order to make some subsequent computations more tractable, I wish to
cluster entrant win probabilities like so:

[(1,  0.049), (2,  0.121), (3,  0.049), (4,  0.024), (5,  0.024), (6,
0.049), (7,  0.072), (8,  0.049), (9,  0.185), (10, 0.024), (11, 0.185),
(12, 0.049), (13, 0.072), (14, 0.049)]

viz. in this case I have 'bucketed' the entrant numbers against 5
representative probabilities and in subsequent computations will deem (for
example) the win probability of 3 to be 0.049, so another way of visualising
the result is:

[((4, 5, 10), 0.024),
((3, 6, 8, 12, 14), 0.049),
((7, 13), 0.072),
((2), 0.121),
((11), 0.185)]

and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185) ~=
1.

My question is: What is the most 'correct' way to cluster these
probabilities? In my case the problem is not totally unconstrained. I would
like to specify the number of buckets (probably will always wish to use
either 5 or 6), so I do not need an algorithm which determines the most
appropriate number of buckets given some cost function. I just need to know
for a given number of buckets, which entrants go in which buckets and what
is the representative probability for each bucket.

The first thing which occurs to me is to sort probabilities into ascending
order, generate all partitions of the list into (say) 5 buckets, and pick
the partition which minimises the sum of squared differences from the mean
of each bucket summed over all buckets. If buckets were not associated with
probabilities I would do this without a second thought... but I wonder if
this is the right thing to do here? I'm too statistically naive to know one
way or the other.

I would appreciate any suggestions re correct approach and also (obviously)