[R] faster computation of cumulative multinomial distribution

Sat Mar 31 01:59:47 CEST 2007

Alberto Monteiro wrote:
> Theo Borm wrote:
>   
>> I have a series of /unequal/ probabilities [p1,p2,...,pk], describing
>> mutually exclusive events, and a "remainder" class with a probability
>> p0=1-p1-p2-....-pk, and need to calculate, for a given number of trials
>> t>=k, the combined probability that each of the classes 1...k 
>> contains at least 1 "event" (the remainder class may be empty).
>>
>> To me this reaks of a sum of multinomial distributions, and indeed, I
>> can readily calculate the correct answer for small figures t,k using 
>> a small R program.
>>
>>     
> The multinomial distribution is a sum of independent cathegorial
> distributions, so there's no such thing as a sum of multinomial
> distributions - unless the sum of multinomial distributions is
> also a multinomial distribution.
>
> Yikes. I think I am saying too much.
>
> Just look here:
> http://en.wikipedia.org/wiki/Multinomial_distribution
>   
*sorry. that should have read "sum of multinomial probabilities"

The following article (first page freely available):

A Representation for Multinomial Cumulative Distribution Functions*
Bruce Levin
/The Annals of Statistics/, Vol. 9, No. 5 (Sep., 1981), pp. 1123-1126

describes more or less what I mean, only instead of calculating

pN=pN(t,p1,p2,...,pt,a1,a2,...at) = P(n1<=a1, n2<=a2, n3<=a3, .... , nt<=at)

I need to calculate
pN= P(n1>=0, n2>=1, n3>=1, .... , nt>=1)
with n1+n2+n3+...+nt=total number of trials

(Note that for notational purposes I renamed my index [0...k] to Levin's
[1...t], hence category 1 represents the "remainder" class, and my
"number of trials" t becomes Levin's sum of a1+a2+a3+...+at  )

Also note  the change from "<=" to ">="
>
> But what do you want to calculate? The probabilities?
>   
I want to calculate single (compound) probability, given t mutually
exclusive and exhaustive events and k trials that the first "slot"
contains /at least/ zero events AND all the other "slots" contain at
least one event.

Think of a sort of "power roulette", played with 58 balls
simultaneously, with a wheel containing 36 red/black slots of unequal
size, and 1 green slot. I need to calculate the probability that each of
the 36 red/black slots contains at least one ball.
> I don't know what you want to do, but maybe this is the case
> where a Monte Carlo Simulation would give a fast and accurate
> result.
>   
hmmm....

I have a hunch that it won't work as my p vector typically contains
values like:

p<-c(0.99, 0.005, 0.003, 0.001, 0.0005, 0.0003, 0.0001, 0.00005,
0.00003, 0.00002).

keeping in mind that the first "green" slot (0.99) represents my
"remainder" class (which may be empty), and calculating the probability
that with 9 trials I get exactly one in each of the other  slots:

p=0.005*0.003*0.001*0.0005*0.0003*0.0001*0.00005*0.00003*0.00002*9!

=~ 2.45*10^-27

so, I'm looking for very rare events.

to calculate this figure even marginally /accurately/ with a monte carlo
simulation would require in well excess of 4*10^26 iterations.....

> Let's say you want to estimate some function of the outcome 
> of this multinomial. Just to fix ideas:
>
> # p - the probabilities
> p <- c(1, 2, 3, 2, 4, 4, 2, 1, 2, 3, 1, 2, 3, 4, 3, 2, 1, 2) 
> # normalize so that sum(p) = 1
> p <- p / sum(p)
> # get k
> k <- length(p)
> # number of trials, say 100
> t <- 100
> # x is the simulation for each trial. x[1] is the number of
> # times we got the thing with probability p[1], etc
> #
> # x <- rmultinom(1, k, p) # simulates 1 experiment
> #
> # f is the function that you will use to "evaluate" x
> #
> f <- function(x) {
>   return(sum(x * 1.05^-(0:(length(x)-1))))  # anything can come here
> }
> #
> # A Monte Carlo simulation will generate "a lot" of xs and
> # get a distribution of f(x)
> #
> # Simulate 10000 xs
> x <- rmultinom(10000, k, p) # takes almost zero time
> # Evaluate 10000 f(x). apply(matrix, 1 (rows) or 2 (coluns), function)
> f.dist <- apply(x, 2, f)
> # analyse it
> hist(f.dist) # etc
>
>   
>> R has dbinom /and/ pbinom functions, but unfortunately only a dmultinom
>> and no pmultinom function... perhaps because there is no (known) 
>> faster way?
>>
>>     
> There's a rmultinom
>   

Thanks,

with kind regards,

Theo Borm.