[Rd] Change in the RNG implementation?

Martin Maechler maechler at stat.math.ethz.ch
Sat Oct 20 21:50:41 CEST 2012


>>>>> Duncan Murdoch <murdoch.duncan at gmail.com>
>>>>>     on Fri, 19 Oct 2012 19:26:39 -0400 writes:

    > On 12-10-19 7:04 PM, Hervé Pagès wrote:
    >> Hi,
    >> 
    >> Looks like the implementation of random number generation changed in
    >> R-devel with respect to R-2.15.1.
    >> 
    >> With R-2.15.1:
    >> 
    >> > set.seed(33)
    >> > sample(49821115, 10)
    >> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
    >> 17098516
    >> [9]   773073  5878451
    >> 
    >> With recent R-devel:
    >> 
    >> > set.seed(33)
    >> > sample(49821115, 10)
    >> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
    >> 17098518
    >> [9]   773073  5878452
    >> 
    >> This is on a 64-bit Ubuntu system.
    >> 
    >> Is this change intended? I didn't see anything in the NEWS file.
    >> 
    >> A potential problem with this is that it will break unit tests
    >> for algorithms that make use of RNG.
    >> 
    >> Another more practical problem (at least for me) is the following:
    >> Bioconductor package maintainers are sometimes working hard on the
    >> development version of their package to improve the performance of
    >> some key functions. Comparing performance between BioC release
    >> (based on R-2.15) and devel (based on R-devel) often requires big
    >> input data that is randomly generated, because it's easiest than
    >> working with real data. Typically a small script is written that
    >> takes care of loading the required packages, generating the input
    >> data, and running a simple analysis. The same script is sourced in
    >> R-2.15 and R-devel, and performance and results are compared.
    >> 
    >> Not being able to generate exactly the same input in the script is
    >> a problem. It can be worked around by generating the input once,
    >> serializing it, and use load() in the script, but that makes things
    >> more complicated and the script is not a standalone script anymore
    >> (cannot be passed around without also passing around the big .rda
    >> file).
    >> 
    >> Thanks,
    >> H.
    >> 

    > I think it was mentioned in the NEWS:

    > \code{sample.int()} has some support for  \eqn{n \ge
    > 2^{31}}{n >= 2^31}: see its help for the limitations.

    > A different algorithm is used for \code{(n, size, replace = FALSE,
    > prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}.  This
    > is much faster and uses less memory, but does give different results.

So, to iterate : The  RNG  has not been changed at all,
but  sample() has, for extreme cases (large n) like yours.

    > I don't think the old algorithm is available, but perhaps it could be 
    > made available by an optional parameter.

I do think we should ideally add such an option or probably
rather allow the more thorough way of either using  
RNGversion(..) or something similar to set sample()'s behavior
to exactly as previously.
Doing "globally" is really needed, as sample() maybe called from a
function (from a function from a function) that is not in the
programmer's hand, and so the programmeR could not even
set the new optional argument if he found out that he had to.

Honestly, I'm surprised Hervé found a real case where the
difference is visible.

Martin


    > Duncan Murdoch



More information about the R-devel mailing list