[R] OT: What distribution is this?

Sun Sep 26 14:14:00 CEST 2010

G'day Rainer,

On Sun, 26 Sep 2010 10:29:08 +0200
Rainer M Krug <r.m.krug at gmail.com> wrote:

> I realized that I am simply interested in the pdf p(y) that y
> *number* of entities (which ones is irrelevant) in N are are *not*
> drawn after the sampling process has been completed. Even simpler (I
> guess), in a first step, I would only need the mean number of
> expected non-drawn entities in N (pMean).
> 
> But what about the range in between?

You can calculate those probilities.

Your problem was sounding vaguely familiar to me yesterday but I could
not put my finger onto its abstract counterpart.  Now with this
clarification I can. :)  Your set up is exactly the one of lottery
draws.  In each draw n of N numbers are drawn without replacement.  So
your question is "what is the probability that I have not seen k of the
N numbers after x draws?". 

This question can easily be answered by using some Markov Chain
theory.  Let Y_l be the number of entities that you have not seen after
the l-th draw, taking possible values 0, 1, ...., N.  Obviously, Y_0=N
with probability 1, and Y_1=N-n with probability 1.  Now, the
probability that Y_l, l>=1, is equal to k is given by;

     0 if k=N-n+1, N-n+2,..., N 
and  sum_{i=0,...,n} P[Y_l=k | Y_{l-1} = k+i] P[Y_{l-1} = k+i] o/w.

P[Y_l=k | Y_{l-1} = k+i] is the probability that the n entities sampled
in the l-th draw consists of i entities of the k+i entities that were
not seen by draw l-1 and n-i entities of the N-k-i entities that were
already seen by draw l-1. This probability is

    choose(k+i, i)*choose(N-k-i, n-i) / choose(N, n)

Note that this probability is independent of l, i.e., Y_0, Y_1, Y_2,...
form a stationary Markov Chain.  

Thus, to answer your question, the strategy would be to calculate the
transition matrix once and then start with either the state vector of
Y_0 or of Y_1 (both of which are rather trivial as mentioned above) and
calculate the state vector of Y_x by repeatedly multiplying the chosen
initial state vector with the transition matrix for the appropriate
number of times.

The transition matrix is (N+1)x(N+1), so it can be rather large, but
the number of non-zero entries in the transition matrix is at most
(N+1)*(n+1), so depending on the relative magnitude of n and N, sparse
matrix techniques might be helpful (or using a language such as C,
FOTRAN, ... for the calculations).

HTH.

Cheers,

	Berwin