[R] OT: What distribution is this?

Peter Dalgaard pdalgd at gmail.com
Sat Sep 25 17:19:03 CEST 2010


On 09/25/2010 04:24 PM, Rainer M Krug wrote:
> Hi
> 
> This is OT, but I need it for my simulation in R.
> 
> I have a special case for sampling with replacement: instead of sampling
> once and replacing it immediately, I sample n times, and then replace all n
> items.
> 
> 
> So:
> 
> N entities
> x samples with replacement
> each sample consists of n sub-samples WITHOUT replacement, which are all
> replaced before the next sample is drawn
> 
> My question is: which distribution can I use to describe how often each
> entity of the N has been sampled?
> 
> Thanks for your help,
> 
> Rainer
> 

How did you know I was in the middle of preparing lectures on the
variance of the hypergeometric distribution and such? ;-)

If you look at a single item, the answer is of course that you have a
binomial with size=x and prob=n/N. The problem is that these binomials
are correlated between items.

If you can make do with a 2nd order approximation, then the covariances
between the indicators for two items being selected is easily found from
the symmetry and the fact that if you sum all N indicators you get the
constant n. I.e. the variance is p(1-p) and the covariance is
-p(1-p)/(N-1). For sums over repeated samples, just multiply everything
by the number x of samples.

If you intend to just count the frequency of a particular feature in
each of your n-samples, i.e., you have x replications of a
hypergeometric experiment, then you can do somewhat better by computing
the explicit convolution of x hypergeometrics (convolve(x, rev(y),
type="o") and Reduce() are your friends). I'm not sure this is actually
worth the trouble, but it should be doable for decent-sized N and x.



-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list