[Rd] efficiency of sample() with prob.

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Jun 24 08:10:46 CEST 2005


`Research' involves looking at all the competitor methods, devising a 
near-optimal strategy and selecting amongst methods according to that 
strategy.  It is not a quick fix we are looking for but something that
will be good for the long term.

On Thu, 23 Jun 2005, Bo Peng wrote:

>> We suggest that you take up your own suggestion, research this area and
>> offer the R project the results of your research for consideration as your
>> contribution.
>
> I implemented Walker's alias method and re-compiled R. Here is what
> I did:
>
> 1. replace function ProcSampleReplace in R-2.1.0/src/main/random.c
>  with the following one
>
> static void ProbSampleReplace(int n, double *p, int *perm, int nans, int *ans)
> {
>    /* allocate memory for a, p and HL */
>    double * q = Calloc(n, double);
>    int * a = Calloc(n, int);
>    int * HL = Calloc(n, int);
>    int * H = HL;
>    int * L = HL+n-1;
>    int i, j, k;
>    double rU;  /* U[0,1)*n */
>
>    /* set up alias table */
>    /* initialize q with n*p0,...n*p_n-1 */
>    for(i=0; i<n; ++i)
>        q[i] = p[i]*n;
>
>    /* initialize a with indices */
>    for(i=0; i<n; ++i)
>        a[i] = i;
>
>    /* set up H and L */
>    for(i=0; i<n; ++i) {
>        if( q[i] >= 1.)
>            *H++ = i;
>        else
>            *L-- = i;
>    }
>
>    while( H != HL && L != HL+n-1) {
>        j = *(L+1);
>        k = *(H-1);
>        a[j] = k;
>        q[k] += q[j] - 1;
>        L++;                                  /* remove j from L */
>        if( q[k] < 1. ) {
>            *L-- = k;                         /* add k to L */
>            --H;                              /* remove k */
>        }
>    }
>
>    /* generate sample */
>    for (i = 0; i < nans; ++i) {
> 	rU = unif_rand() * n;
>
>        k = (int)(rU);
>        rU -= k;  /* rU becomes rU-[rU] */
>
>        if( rU < q[k] )
>            ans[i] = k+1;
>        else
>            ans[i] = a[k]+1;
>    }
>    Free(HL);
>    Free(a);
>    Free(q);
> }
>
> 2. make and make install
>
> 3. test the new sample function by code like
>
>> b=sample(seq(1,100), prob=seq(1,100), replace=TRUE, size=1000000)
>> table(b)/1000000*sum(seq(1,100))
>
> 4. run the following code in current R 2.1.0 and updated R.
>
> for(prob in seq(1,4)){
>  for(sample in seq(1,4)){
>    x = seq(1:(10^prob))   # short to long x
>    p = abs(rnorm(length(x)))  # prob vector
>    times = 10^(6-prob)   # run shorter cases more times
>    Rprof(paste("sample_", prob, "_", sample, ".prof", sep=''))
>    for(t in seq(1,times)){
>      sample(x, prob=p, size=10^sample, replace=TRUE )
>    }
>    Rprof(NULL)
>  }
> }
>
> Basically, I tried to test the performance of sample(replace=TRUE, prob=..)
> with different length of x and size.
>
> 5. process the profiles and here is the result.
> p: length of prob and x
> size: size of sample
> cell: execution time of old/updated sample()
>
>  size\p    10          10^2        10^3       10^4
>  10       2.4/1.6      0.32/0.22   0.20/0.08  0.24/0.06
>  10^2     3.1/2.6      0.48/0.28   0.28/0.06  0.30/0.06
>  10^3     11.8/11.1    1.84/1.14   0.94/0.18  0.96/0.08
>  10^4     96.8/96.6    15.34/9.68  7.54/1.06  7.48/0.16
>  run:     10000        1000        100        10 times
>
> We can see that the alias method is quicker than the linear search
> method in all cases. The performance difference is greatest (>50 times)
> when the original algorithm need to search in a long prob vector.
>
> I have not thoroughly tested the new function. I will do so if you
> (the developers) think that this has the potential to be incorporated
> into R.
>
> Thanks.
>
> Bo Peng
> Department of Statistics
> Rice University
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list