[Rd] Change in the RNG implementation?

Hervé Pagès hpages at fhcrc.org
Mon Oct 22 08:02:50 CEST 2012


Hi Duncan, Martin,

Thanks for your answers.

For my real case I was generating millions of random positions
on a genome.

I compared sample.int() performance between R-2.15.1 and R-devel,
and, for me, it performs better in R-2.15.1 (almost 3x faster and
also uses slightly less memory):

With R-2.15.1:

   > set.seed(33)

   > system.time(random_chrom_pos <- sample(199000666L, 95000777L))
      user  system elapsed
     4.964   0.268   5.242

   > gc()
              used  (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells   137285   7.4     350000   18.7    350000   18.7
   Vcells 47633785 363.5  154735917 1180.6 147135703 1122.6

   > sessionInfo()
   R version 2.15.1 (2012-06-22)
   Platform: x86_64-unknown-linux-gnu (64-bit)

   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=C                 LC_NAME=C
    [9] LC_ADDRESS=C               LC_TELEPHONE=C
   [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base

With R-devel:

   > set.seed(33)

   > system.time(random_chrom_pos <- sample(199000666L, 95000777L))
      user  system elapsed
    14.532   0.296  14.854

   > gc()
              used  (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells   145525   7.8     350000   18.7    350000   18.7
   Vcells 47644082 363.5  152959996 1167.0 182023372 1388.8

   > sessionInfo()
   R Under development (unstable) (2012-10-02 r60861)
   Platform: x86_64-unknown-linux-gnu (64-bit)

   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=C                 LC_NAME=C
    [9] LC_ADDRESS=C               LC_TELEPHONE=C
   [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base

FWIW my R-2.15.1 and R-devel were configured with
--disable-byte-compiled-packages, otherwise, I use all the
defaults. Also my system is a standard Ubuntu 12.04 installation
with no fancy settings/tweakings/customizations.

Thanks,
H.


On 10/20/2012 12:50 PM, Martin Maechler wrote:
>>>>>> Duncan Murdoch <murdoch.duncan at gmail.com>
>>>>>>      on Fri, 19 Oct 2012 19:26:39 -0400 writes:
>
>      > On 12-10-19 7:04 PM, Hervé Pagès wrote:
>      >> Hi,
>      >>
>      >> Looks like the implementation of random number generation changed in
>      >> R-devel with respect to R-2.15.1.
>      >>
>      >> With R-2.15.1:
>      >>
>      >> > set.seed(33)
>      >> > sample(49821115, 10)
>      >> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
>      >> 17098516
>      >> [9]   773073  5878451
>      >>
>      >> With recent R-devel:
>      >>
>      >> > set.seed(33)
>      >> > sample(49821115, 10)
>      >> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
>      >> 17098518
>      >> [9]   773073  5878452
>      >>
>      >> This is on a 64-bit Ubuntu system.
>      >>
>      >> Is this change intended? I didn't see anything in the NEWS file.
>      >>
>      >> A potential problem with this is that it will break unit tests
>      >> for algorithms that make use of RNG.
>      >>
>      >> Another more practical problem (at least for me) is the following:
>      >> Bioconductor package maintainers are sometimes working hard on the
>      >> development version of their package to improve the performance of
>      >> some key functions. Comparing performance between BioC release
>      >> (based on R-2.15) and devel (based on R-devel) often requires big
>      >> input data that is randomly generated, because it's easiest than
>      >> working with real data. Typically a small script is written that
>      >> takes care of loading the required packages, generating the input
>      >> data, and running a simple analysis. The same script is sourced in
>      >> R-2.15 and R-devel, and performance and results are compared.
>      >>
>      >> Not being able to generate exactly the same input in the script is
>      >> a problem. It can be worked around by generating the input once,
>      >> serializing it, and use load() in the script, but that makes things
>      >> more complicated and the script is not a standalone script anymore
>      >> (cannot be passed around without also passing around the big .rda
>      >> file).
>      >>
>      >> Thanks,
>      >> H.
>      >>
>
>      > I think it was mentioned in the NEWS:
>
>      > \code{sample.int()} has some support for  \eqn{n \ge
>      > 2^{31}}{n >= 2^31}: see its help for the limitations.
>
>      > A different algorithm is used for \code{(n, size, replace = FALSE,
>      > prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}.  This
>      > is much faster and uses less memory, but does give different results.
>
> So, to iterate : The  RNG  has not been changed at all,
> but  sample() has, for extreme cases (large n) like yours.
>
>      > I don't think the old algorithm is available, but perhaps it could be
>      > made available by an optional parameter.
>
> I do think we should ideally add such an option or probably
> rather allow the more thorough way of either using
> RNGversion(..) or something similar to set sample()'s behavior
> to exactly as previously.
> Doing "globally" is really needed, as sample() maybe called from a
> function (from a function from a function) that is not in the
> programmer's hand, and so the programmeR could not even
> set the new optional argument if he found out that he had to.
>
> Honestly, I'm surprised Hervé found a real case where the
> difference is visible.




>
> Martin
>
>
>      > Duncan Murdoch
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-devel mailing list