[Rd] Sampling with unequal probabilities

Tim Hesterberg timh at insightful.com
Thu Feb 7 20:39:57 CET 2008


This is in followup to a thread on R-help with subject "Sampling".
I claim that R does the wrong thing by default when
sampling with unequal probabilities without replacement -
the selection probabilities are not proportional to 'prob',
for any draw after the first: I suggest that R do what S-PLUS now does
(though you're free to choose a better implementation).

What S-PLUS now does is:
If 'replace==TRUE' then sample with replacement.
Otherwise, sample without replacement or with minimal replacement,
according to the value of argument 'minimal'.
The default is 'minimal = length(prob) > 1'.
One can specify 'minimal = FALSE' for backward compatibility.

In the case of sampling with minimal replacement, duplicates
may occur whenever 'max(size*prob) > 1', and are guaranteed if
'max(size*prob) >= 2'.  You can think of drawing
'trunc(size*prob)' observations deterministically,
then drawing the remaining 'size - sum(trunc(size*prob))' observations
without replacement, with an adjusted prob vector.

The algorithm I used is relatively simple.  It is one of the
Brewer and Hanif algorithms (though I don't recall if they used
the final random shuffle).  Here's one description, or you
may prefer the description in
Pedro J. Saavedra (2005)
Comparison of Two Weighting Schemes for Sampling with Minimal Replacement 
http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf

In the case of minimal = TRUE (sampling with minimal replacement),
with unequal probabilities:
* scale prob to sum to 1,
* randomly sort the observations along with prob
* let cprob = cumsum(prob), 
* draw a systematic sample of size 'size' in (0,1):
  uniformVector <- (1:size - runif(1))/size
* observation i is selected if cprob[i-1] < uniformVector[j] <= cprob[i]
  for any j
  In the case (size*max(prob) > 1), the number of times the observation
  is selected is the number of j's for which the inequalities hold.
* the selected observations are randomly sorted again.

Tim Hesterberg
Disclaimer - these are my opinions, not Insightful's.



More information about the R-devel mailing list