[R] Random # generator accuracy

Jim Bouldin jrbouldin at ucdavis.edu
Thu Jul 23 22:56:43 CEST 2009


You are absolutely correct Ted.  When no weights are applied it doesn't
matter if you sample with or without replacement, because the probability
of choosing any particular value is equally distributed among all such. 
But when they're weighted unequally that's not the case.

It is also interesting to note that if the problem is set up slightly
differently, by say defining the vector x as:
x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving the
same probability of selection for the 12 integers as before, the same
problem does not arise, or at least not as severely:

> x2
 [1]  1  2  3  4  5  6  7  8  9 10 11 12  7  8  9 10 11 12

> d = mean(replicate(1000000,(sample(x2, 3))));d  # (1 million samples from
x2, of size 3; the mean should be 7.50)
[1] 7.499233

> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e  # (1
million samples from x2, of size 3; with replacement this time the mean
should still be 7.50)
[1] 7.502085

> d/e
[1] 0.9996198

Jim
> 
> To obtain the result you expected, you would need to explicitly
> specify "replace=TRUE", since the default for "replace" is FALSE.
> (though probably what you really intended was sampling without
> replacement).
> 
> Read carefully what is said about "prob" in '?sample' -- when
> replace=FALSE, the probability of inclusion of an element is not
> proportional to its weight in 'prob'.
> 
> The reason is that elements with higher weights are more likely
> to be chosen early on. This then knocks that higher weight out
> of the contest, making it more likely that elements with smaller
> weights will be sampled subsequently. Hence the mean of the sample
> will be biased slightly downwards, relative to the weighted mean
> of the values in x.
> 
>   table(replicate(1000000,(sample(x, 3))))
>   #      1      2      3      4      5      6
>   # 250235 250743 249603 250561 249828 249777
>   #      7      8      9     10     11     12
>   # 249780 250478 249591 249182 249625 250597
> 
> (so all nice equal frequencies)
> 
>   table(replicate(1000000,(sample(x, 3,prob=weights))))
>   #      1      2      3      4      5      6
>   # 174873 175398 174196 174445 173240 174110
>   #      7      8      9     10     11     12
>   # 325820 326140 325289 325098 325475 325916
> 
> Note that the frequencies of the values with weight=2 are a bit
> less than twice the frequencies of the values with weight=1:
> 
>   (325820+326140+325289+325098+325475+325916)/
>     (174873+175398+174196+174445+173240+174110)
>   # [1] 
> 
> 
> In fact this is fairly easily caluclated. The possible combinations
> (in order of sampling) of the two weights, with their probabilities,
> are:
>                                      1s  2s
> -------------------------------------------
> 1 1 1   P =  6/18 *  5/17 *  4/16    3   0
> 1 1 2   P =  6/18 *  5/17 * 12/16    2   1
> 1 2 1   P =  6/18 * 12/17 *  5/15    2   1
> 1 2 2   P =  6/18 * 12/17 * 10/15    1   2
> 2 1 1   P = 12/18 *  6/16 *  5/15    2   1
> 2 1 2   P = 12/18 *  6/16 * 10/15    1   2
> 2 2 1   P = 12/18 * 10/16 *  6/14    1   2
> 2 2 2   P = 12/18 * 10/16 *  8/14    0   3
> 
> So the expected number of "weight=1" in the sample is
> 
>   3*(6/18 *  5/17 *  4/16)  + 2*(6/18 *  5/17 * 12/16) +
>   2*(6/18 * 12/17 *  5/15)  + 1*(6/18 * 12/17 * 10/15) +
>   2*(12/18 *  6/16 *  5/15) + 1*(12/18 *  6/16 * 10/15) +
>   1*(12/18 * 10/16 *  6/14) + 0
>   = 1.046218
> 
> Hence the expected number of "weight=2" in the sample is
> 
>   3 - 1.046218 = 1.953782
> 
> and their ratio 1.953782/1.046218 = 1.867471
> 
> Compare this with the value 1.867351 (above) obtained by simulation!
> 
> Ted.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 23-Jul-09                                       Time: 21:05:07
> ------------------------------ XFMail ------------------------------
> 

Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740




More information about the R-help mailing list