[Rd] sample.int() algorithms

Nathan Kurz nate at verse.com
Wed Sep 9 23:57:37 CEST 2015

```I was experiencing a strange pattern of slowdowns when using
sample.int(), where sampling from a one  population would sometimes
take 1000x longer than taking the same number of samples from a
slightly larger population.   For my application, this resulted in a
runtime of several hours rather than a few seconds.  Looking into it,
I saw that sample.int() is hardcoded to switch algorithms when the
population is larger than 1e+7, and I was straddling this boundary:

sample.int  <- function(n, size = n, replace = FALSE, prob = NULL)
{
if (!replace && is.null(prob) && n > 1e7 && size <= n/2)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}

do_sample2() takes the approach of taking a sample, and then checking
if this sample is a duplicate.  As long as the population size is much
larger than numbers of samples, this will be efficient.  This explains
the check for "size <= n/2".   But I'm not sure why the "n > 1e7"
check is needed.

I put together some sample code to show the difference in timing
letting sample.int() choose the cutoff point versus manually
specifying the use of do_sample2():

###  compare times for sample.int() vs internal function sample2()
compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9),
sampleSizeList=c(10, 100, 1000, 10000),
numReplications=1000) {
for (sampleSize in sampleSizeList) {
for (popSize in popSizeList)  {
elapsed1 = system.time(replicate(numReplications,
sample.int(popSize, sampleSize)))[["elapsed"]]
elapsed2 = system.time(replicate(numReplications,
.Internal(sample2(popSize, sampleSize))))[["elapsed"]]
cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n",
sampleSize, popSize, elapsed1, elapsed2))
}
cat("\n")
}
}

compareSampleTimes()

https://gist.github.com/nkurz/8fa6ff3772a054294531

And here's the output showing the 1000x slowdowns at population sizes
of 10e7 under R-3.2.2:

\$ Rscript compareSampleTimes.R
Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds
Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds
Sample 10 from 1e+07: 13.190000 vs 0.003000 seconds
Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds
Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds

Sample 100 from 1e+05: 0.180000 vs 0.007000 seconds
Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds
Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds
Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds
Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds

Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds
Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds
Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds
Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds
Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds

Sample 10000 from 1e+05: 0.414000 vs 0.712000 seconds
Sample 10000 from 1e+06: 1.100000 vs 0.453000 seconds
Sample 10000 from 1e+07: 14.882000 vs 0.443000 seconds
Sample 10000 from 1e+08: 0.448000 vs 0.443000 seconds
Sample 10000 from 1e+09: 0.445000 vs 0.443000 seconds

Since my usage involves taking samples of 1K from populations of about
10M, the do_sample2() approach is the clear winner: .05 seconds vs 13
seconds.  I tested on a couple machines, and got similar results on
both. Would there be a downside to having sample.int() use the faster
do_sample2() approach for populations of all sizes whenever the ratio
of population size to sample size is large?

--nate

```