[R] Resample with replacement to produce many rarefaction curves with same number of samples

Thu Sep 8 18:07:14 CEST 2016

> On 8 Sep 2016, at 16:25, David L Carlson <dcarlson at tamu.edu> wrote:
> 
> Sampling without replacement treats the sample as the population for the purposes of estimating the outcomes at smaller sample sizes. Sampling with replacement (the same as bootstrapping) treats the sample as one possible outcome of a larger population at that sample size. 

But the resamples aren't actually independent samples from the underlying population, and in contrast to the usual applications of bootstrapping they don't give a good approximation of independent samples if you look at type ("species") counts.

In my understanding – which may be incomplete – bootstrapping works for a test statistic computed from the measurements of a single numeric random variable (or perhaps several r.v.) in an i.i.d. sample.  The type count cannot be expressed as such a test statistic, hence we get the underestimation bias from sampling with replacement.

In NLP, we often use parametric power-law models of the population in order to extrapolate type counts (e.g. using this implementation http://zipfr.r-forge.r-project.org), but this implies strong (and often inappropriate) assumptions about the population.

Best,
Stefan