[R] Random # generator accuracy

Fri Jul 24 01:34:50 CEST 2009

Perfectly explained Ted.  One might, at first reflection, consider that
simply repeating the values 7 through 12 and  sampling (w/o replacement)
from among the 18 resulting values, would be similar to just doubling the
selection probabilities for 7 through 12 and then sampling. That would
clearly not be true though.
Jim
> 
> Whereas, if you replace x = c(1,2,3,4,5,6,7,8,9,110,11,12)
> with the "weighted" equivalent, doubling up 7-12 as in your
>   x2 = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12),
> each of the 18 items now has the same weight as the others,
> and the "unweighted" sampling
>   mean(replicate(1000000,(sample(x2, 3))))
> now gives the mean of the 18 values (7.5); whereas -- as my
> calculation showed -- the effect of the "sequential" weighting is
> to bias the result slightly downwards (in your example; generally,
> in favour of the items with lower weights), since the way weighting
> works in sample() is not equivalent to replicating each item "weight"
> times.
> 
> The general problem, of sampling without replacement in such a way
> that for each item the probability that it is included in the sample
> is proportional to a pre-assigned weight ("sampling with probability
> proportional to size") is quite tricky and, for certain choices
> of weights, impossible. For a glimpse of what's inside the can of
> worms, have a look at the reference manual for the 'sampfling'
> package, in particular the function samprop():
> 
> http://www.stats.bris.ac.uk/R/web/packages/sampfling/sampfling.pdf
> 
> Ted.
> 
> On 23-Jul-09 20:56:43, Jim Bouldin wrote:
> > 
> > You are absolutely correct Ted.  When no weights are applied it doesn't
> > matter if you sample with or without replacement, because the
> > probability
> > of choosing any particular value is equally distributed among all such.
> > But when they're weighted unequally that's not the case.
> > 
> > It is also interesting to note that if the problem is set up slightly
> > differently, by say defining the vector x as:
> > x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving
> > the
> > same probability of selection for the 12 integers as before, the same
> > problem does not arise, or at least not as severely:
> > 
> >> x2
> >  [1]  1  2  3  4  5  6  7  8  9 10 11 12  7  8  9 10 11 12
> > 
> >> d = mean(replicate(1000000,(sample(x2, 3))));d  # (1 million samples
> >> from
> > x2, of size 3; the mean should be 7.50)
> > [1] 7.499233
> > 
> >> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e  # (1
> > million samples from x2, of size 3; with replacement this time the mean
> > should still be 7.50)
> > [1] 7.502085
> > 
> >> d/e
> > [1] 0.9996198
> > 
> > Jim
> >> 
> >> To obtain the result you expected, you would need to explicitly
> >> specify "replace=TRUE", since the default for "replace" is FALSE.
> >> (though probably what you really intended was sampling without
> >> replacement).
> >> 
> >>  -- when
> >> replace=FALSE, the probability of inclusion of an element is not
> >> proportional to its weight in 'prob'.
> >> 
> >> The reason is that elements with higher weights are more likely
> >> to be chosen early on. This then knocks that higher weight out
> >> of the contest, making it more likely that elements with smaller
> >> weights will be sampled subsequently. Hence the mean of the sample
> >> will be biased slightly downwards, relative to the weighted mean
> >> of the values in x.
> >> 
> >>   table(replicate(1000000,(sample(x, 3))))
> >>   #      1      2      3      4      5      6
> >>   # 250235 250743 249603 250561 249828 249777
> >>   #      7      8      9     10     11     12
> >>   # 249780 250478 249591 249182 249625 250597
> >> 
> >> (so all nice equal frequencies)
> >> 
> >>   table(replicate(1000000,(sample(x, 3,prob=weights))))
> >>   #      1      2      3      4      5      6
> >>   # 174873 175398 174196 174445 173240 174110
> >>   #      7      8      9     10     11     12
> >>   # 325820 326140 325289 325098 325475 325916
> >> 
> >> Note that the frequencies of the values with weight=2 are a bit
> >> less than twice the frequencies of the values with weight=1:
> >> 
> >>   (325820+326140+325289+325098+325475+325916)/
> >>     (174873+175398+174196+174445+173240+174110)
> >>   # [1] 
> >> 
> >> 
> >> In fact this is fairly easily caluclated. The possible combinations
> >> (in order of sampling) of the two weights, with their probabilities,
> >> are:
> >>                                      1s  2s
> >> -------------------------------------------
> >> 1 1 1   P =  6/18 *  5/17 *  4/16    3   0
> >> 1 1 2   P =  6/18 *  5/17 * 12/16    2   1
> >> 1 2 1   P =  6/18 * 12/17 *  5/15    2   1
> >> 1 2 2   P =  6/18 * 12/17 * 10/15    1   2
> >> 2 1 1   P = 12/18 *  6/16 *  5/15    2   1
> >> 2 1 2   P = 12/18 *  6/16 * 10/15    1   2
> >> 2 2 1   P = 12/18 * 10/16 *  6/14    1   2
> >> 2 2 2   P = 12/18 * 10/16 *  8/14    0   3
> >> 
> >> So the expected number of "weight=1" in the sample is
> >> 
> >>   3*(6/18 *  5/17 *  4/16)  + 2*(6/18 *  5/17 * 12/16) +
> >>   2*(6/18 * 12/17 *  5/15)  + 1*(6/18 * 12/17 * 10/15) +
> >>   2*(12/18 *  6/16 *  5/15) + 1*(12/18 *  6/16 * 10/15) +
> >>   1*(12/18 * 10/16 *  6/14) + 0
> >>   = 1.046218
> >> 
> >> Hence the expected number of "weight=2" in the sample is
> >> 
> >>   3 - 1.046218 = 1.953782
> >> 
> >> and their ratio 1.953782/1.046218 = 1.867471
> >> 
> >> Compare this with the value 1.867351 (above) obtained by simulation!
> >> 
> >> Ted.
> >> 
> >> --------------------------------------------------------------------
> >> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> >> Fax-to-email: +44 (0)870 094 0861
> >> Date: 23-Jul-09                                       Time: 21:05:07
> >> ------------------------------ XFMail ------------------------------
> >> 
> > 
> > Jim Bouldin, PhD
> > Research Ecologist
> > Department of Plant Sciences, UC Davis
> > Davis CA, 95616
> > 530-554-1740
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 23-Jul-09                                       Time: 23:00:56
> ------------------------------ XFMail ------------------------------
> 

Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740