[R] Sum of Bernoullis with varying probabilities

Sat Oct 7 23:03:31 CEST 2006

Hi again.
I had suspected that doing the calculation by a convolution
method might be both straightforward and efficient in R.

I've now located convolve() (in 'base'!!), and have a solution
using this function. Details below. Original statement of
problem, and method originally proposed, repeated below for
reference.

On 10/6/06, Ted Harding <Ted.Harding at nessie.mcc.ac.uk> wrote:
> Hi Folks,
>
> Given a series of n independent Bernoulli trials with
> outcomes Yi (i=1...n) and Prob[Yi = 1] = Pi, I want
>
>   P = Prob[sum(Yi) = r]  (r = 0,1,...,n)
>
> I can certainly find a way to do it:
>
> Let p be the vector c(P1,P2,...,Pn).
> The cases r=0 and r=n are trivial (and also are exceptions
> for the following routine).
>
> For a given value of r in (1:(n-1)),
>
>   library(combinat)
>   Set <- (1:n)
>   Subsets <- combn(Set,r)
>   P <- 0
>   for(i in (1:dim(Subsets)[2])){
>     ix <- numeric(n)
>     ix[Subsets[,i]] <- 1
>     P <- P + prod((p^ix) * ((1-p)^(1-ix)))
>   }

The convolution method implemented in convolve computes, for
two series X[1],X[2],...,X[n] and Y[1],Y[2],...,Y[m] the
series (k = 1, 2, ... , n+m-1):

  Z[k] = sum( X[k-m+i]*Y[i] )

over valid values of i, i.e. products of terms of X matched
with terms of a shifted Y, using the "open" method (see
?convolve).

This is not quite what is wanted for the type of convolution
needed to compute the distribution of the sum of two integer
random variables, since one needs to reverse one of the series
being convolved. This is the basis of the following:

Given a vector p of probabilities P1, P2, P3, for "Y=1" in
successive trials, I need to convolve the successive Bernoulli
distributions (1-P1,P1), (1-P2,P2), ... , (1-Pn,Pn). Hence:

  ps<-cbind(1-p,p); u<-ps[1,]; u1<-u
  for(i in (2:16)){
    u<-convolve(u,rev(ps[i,]),type="o")
  }

In the case I was looking at, I had already obtained the vector
P of probabilities P(Sum(Y)=0), P(sum(Y)=1), ... by the method
previously described (repeated above); and here n=16.

Having obtained u by the above routine, I can now compare the
results, u with P:

cbind((0:16),u,P)
   r             u            P
   0  1.353353e-01 1.353353e-01
   1  3.007903e-01 3.007903e-01
   2  2.976007e-01 2.976007e-01
   3  1.747074e-01 1.747074e-01
   4  6.826971e-02 6.826971e-02
   5  1.884969e-02 1.884969e-02
   6  3.804371e-03 3.804371e-03
   7  5.720398e-04 5.720398e-04
   8  6.463945e-05 6.463945e-05
   9  5.489454e-06 5.489454e-06
  10  3.473997e-07 3.473997e-07
  11  1.607822e-08 1.607822e-08
  12  5.262533e-10 5.262532e-10
  13  1.148626e-11 1.148618e-11
  14  1.494650e-13 1.493761e-13
  15  9.008700e-16 8.764298e-16
  16 -1.896716e-17 1.313261e-19

so, apart from ignorable differences in the very small
probabilities for r>11, they agree. I have to admit, though,
that I had to tread carefully (and experiment with Binomials)
to make sure of exactly how to introduce the reversal
(at "rev(ps[i,])").

And the convolution method is *very* much faster than the
selection of all subsets method!

However, I wonder whether the "subsets" method may be the more
accurate of the two (not that it really matters).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Oct-06                                       Time: 22:03:27
------------------------------ XFMail ------------------------------