[Rd] proposed change to 'sample'

Sun Jun 20 19:49:43 CEST 2010

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Patrick Burns
> Sent: Sunday, June 20, 2010 3:08 AM
> To: r-devel at r-project.org
> Subject: [Rd] proposed change to 'sample'
> 
> There is a weakness in the 'sample'
> function that is highlighted in the
> help file.  The 'x' argument can be
> either the vector from which to sample,
> or the maximum value of the sequence
> from which to sample.
> 
> This can be ambiguous if the length of
> 'x' is one.
> 
> I propose adding an argument that allows
> the user (programmer) to avoid that
> ambiguity:
> 
> function (x, size, replace = FALSE, prob = NULL,
>      max = length(x) == 1L && is.numeric(x) && x >= 1)

S+'s sample() has an argument 'n' to achieve
the same result.  It has been there since at
least 2005 (S+ 7.0.6).  sample(n=n) means to
return a sample from seq_along(n), where n must
be a scalar nonnegative integer.  sample(x=x)
retains it old ambiguous meaning.
  sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)

S+ also has an rsample function where n (with
the same meaning) is the only way to specify the
population.  It also has an order=TRUE/FALSE argument
where order=TRUE means to randomly order the output.
order=FALSE means that the ordering of the output is
unspecified, but it allows the person writing rsample
methods to use the quickest way to get a random sample
(for big data it can be fastest to return the sample
from 1:n in increasing order). 
  rsample(n, size = n, replace = F, prob = NULL,
        bigdata = F, minimal = NULL, ..., order = T)
I like the idea of separating the concepts of sampling
and permuting data.  Many statistics are invariant to
ordering of the data and it can be a waste of time
to randomly order a sample to feed to such functions.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> {
>      if (max) {
>          if (missing(size))
>              size <- x
>          .Internal(sample(x, size, replace, prob))
>      }
>      else {
>          if (missing(size))
>              size <- length(x)
>          x[.Internal(sample(length(x), size, replace, prob))]
>      }
> }
> <environment: namespace:base>
> 
> 
> This just takes the condition of the first
> 'if' to be the default value of the new 'max'
> argument.
> 
> So in the "surprise" section of the examples
> in the 'sample' help file
> 
> sample(x[x > 9])
> 
> and
> 
> sample(x[x > 9], max=FALSE)
> 
> have different behaviours.
> 
> By the way, I'm certainly not convinced that
> 'max' is the best name for the argument.
> 
> -- 
> Patrick Burns
> pburns at pburns.seanet.com
> http://www.burns-stat.com
> (home of 'Some hints for the R beginner'
> and 'The R Inferno')
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>