[R] Question about random sampling in R

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Thu Oct 19 20:34:16 CEST 2006


On 19-Oct-06 tom soyer wrote:
> Hi,
> 
> I looked up the help file on sample(), but didn't find the
> info I was looking for.
> 
> When sample() is used to resample from a distribution, e.g.,
> bootstrap, how does it do it? Does it use an uniform distribution,
> e.g., runif(), or something else?

I don't know the details of the algorithm, but since sample()
has flexible options it may be helpful to describe the effect
of sample() in different cases.

1. sample(x,r) where x is a vector of length n
In effect, the index values (1:n) of x are sampled from
without replacement (default) with a uniform probability
distribution over the available elements at all stages.
Hence, i1 is sampled from (1:n) with probability 1/n for
each possibility. Then i2 is sampled from the remainder
with probability 1/(n-1) for each, and so on until r items
(all distinct) have been sampled. If the resulting indices
are {i1,i2,...,ir} then the result is x[i1],x[i2],...,x[ir].
Thus, if some of the values in x[1],...,x[n] are equal,
you can get 2 or more items in the sample which are equal
even though the sampling is done without replacement (since
it is the indices which are sampled).
[NB I'm describing the *effect* here, not saying that this
is how the algorithm operates]

2. sample(x, replace=TRUE)
Similar to [1], except that the sampled index is returned
to the pool and is available to be sampled again, so at each
stage the probability of any value being chosen is 1/n.

3. sample(x, replace=TRUE, prob=p) where p is a vector of
   probability weights (which must not all be 0, and none
   negative).
First, p is converted into a probability distribution
(summing to 1) (in effect by dividing by the sum).
Then an index i1 is sampled from (1:n) with probability
p[i] that i is chosen. This is repeated (with previously
sampled i's still available) until r index values have been
sampled -- i1,...,ir. The result is x[i1],...,x[ir].

4. sample(x, prob=p) [without replacement]
First p is scaled to sum to 1, then i1 is sampled as in [3].
The remaining p-values are rescaled so as to sum to 1,
and i2 is sampled from the remaining i's; and so on.

These are the essential variants of the use of sample().

runif() can be used to sample i1 from (1:n) with equal
probabilities by selecting i if runif() is <= i and > (i-1)
for i = 1:n.

Similarly runif() can be used to sample i1 from (1:n)
with probabilities p1,...,pn by selecting i if

  p[1] + ... + p[i-1] < runif() <= p[1] + ... + p[i]

[LHS=0 if i=0], since the probability of this happening is p[i].

> And, when the help file
> says:"sample(x) generates a random permutation of the elements
> of x (or 1:x)",

Since the default value of r (size of sample) is the length
of x, say n, sample(x) (see [1] above) will sample n elements
without replacement from the n elements of x with uniform
probabilities at each stage. In effect, n elements i1,i2,...,in
will be sampled without replacement from (1:n), giving a
random permutation of (1:n), so the result x[i1],...,x[in]
will be a random permutation of x[1],...,x[n] (though
different random permutations may look identical if there
are equal values in x[1],...,x[n]).

> would I be correct if I translate the statement
> as follows: it means that the order of sequence, which was
> generated from a uniform distribution, would look like a
> random normal distribution.

No. A normal distribution has nothing to do with it!

*Unless* the values x[1],...,x[n] already loooked like values
which had already been sampled from a normal distribution (but
were, say, in increasing order of size). Then sample(x) would
shuffle them into random order so the result could then look
like a real sample according ot eh order in which the data
came in.

Hoping this helps!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 19-Oct-06                                       Time: 19:34:13
------------------------------ XFMail ------------------------------



More information about the R-help mailing list