[R] OT: What distribution is this?

Sun Sep 26 12:17:11 CEST 2010

On 09/26/2010 10:29 AM, Rainer M Krug wrote:
> Hi Peter, H Berwin,
> 
> thanks a lot for your clarifications, it makes more sense now. But
> having our input and thinking a little bit more about the problem, I
> realized that I am simply interested in the pdf p(y) that y *number* of
> entities (which ones is irrelevant) in N are are *not* drawn after the
> sampling process has been completed. Even simpler (I guess), in a first
> step, I would only need the mean number of expected non-drawn entities
> in N (pMean).
> 
> The range of my values:
> N is in the range of 1 --- 100 000
> x is in the range of 10 --- 40 000 000
> n is in the range of 20
> 
> I guess in cases where n*x is substantially smaller then N, I could
> simply use a binominal distribution for n*x samples to approximate it --
> right? 
> For cases where n*x is substantially bigger then N, I can safely
> (especially in the context of my simulation) assume that all entities in
> N are drawn at least once.
> 
> But what about the range in between? 

As long as you are only looking for the mean, I think it is easy: The
probability that a particular entity is not sampled in x trials is
((N-n)/N)^x and the mean number of such entities is just N times as
much. The variance is slightly harder, but not excessively so (read: I
know that you start by working out the probabilities in the 2x2 tables
of the joint distribution of two indicators for an entity never being
sampled, use this to get the covariance, etc., I just haven't actually
done it.)

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com