[R] Bucketing/Grouping Probabilities

Wed Nov 19 17:10:04 CET 2008

Try this:

x <- c(1,  0.049,  0.129,  0.043,  0.013,  0.015, 0.040,  0.066,
0.038,  0.2040, 0.0221, 0.234, 0.0443, 0.0684, 0.035)
cl <- kmeans(x, 5)
cl
newold <- with(cl, data.frame(old = x, new = centers[cluster]))
newold

On Wed, Nov 19, 2008 at 10:43 AM, Random Walker <kinch1967 at gmail.com> wrote:
>
> I have a list of entrants (1-14 in this example) in a competitive event  and
> corresponding win probabilities for each entrant.
>
> [(1,  0.049), (2,  0.129), (3,  0.043), (4,  0.013), (5,  0.015), (6,
> 0.040), (7,  0.066), (8,  0.038), (9,  0.204), (10, 0.022), (11, 0.234),
> (12, 0.044), (13, 0.068), (14, 0.035)]
>
> So, of course Sum(ps) = 1.
>
> In order to make some subsequent computations more tractable, I wish to
> cluster entrant win probabilities like so:
>
> [(1,  0.049), (2,  0.121), (3,  0.049), (4,  0.024), (5,  0.024), (6,
> 0.049), (7,  0.072), (8,  0.049), (9,  0.185), (10, 0.024), (11, 0.185),
> (12, 0.049), (13, 0.072), (14, 0.049)]
>
> viz. in this case I have 'bucketed' the entrant numbers against 5
> representative probabilities and in subsequent computations will deem (for
> example) the win probability of 3 to be 0.049, so another way of visualising
> the result is:
>
> [((4, 5, 10), 0.024),
>  ((3, 6, 8, 12, 14), 0.049),
>  ((7, 13), 0.072),
>  ((2), 0.121),
>  ((11), 0.185)]
>
> and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185) ~=
> 1.
>
> My question is: What is the most 'correct' way to cluster these
> probabilities? In my case the problem is not totally unconstrained. I would
> like to specify the number of buckets (probably will always wish to use
> either 5 or 6), so I do not need an algorithm which determines the most
> appropriate number of buckets given some cost function. I just need to know
> for a given number of buckets, which entrants go in which buckets and what
> is the representative probability for each bucket.
>
> The first thing which occurs to me is to sort probabilities into ascending
> order, generate all partitions of the list into (say) 5 buckets, and pick
> the partition which minimises the sum of squared differences from the mean
> of each bucket summed over all buckets. If buckets were not associated with
> probabilities I would do this without a second thought... but I wonder if
> this is the right thing to do here? I'm too statistically naive to know one
> way or the other.
>
> I would appreciate any suggestions re correct approach and also (obviously)
> any tips on how one might go about this in R using canned functions.
>
> Many thanks!
>
>
>
> --
> View this message in context: http://www.nabble.com/Bucketing-Grouping-Probabilities-tp20582544p20582544.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>