[R] applying data generating function

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Mar 8 18:33:54 CET 2004


On Mon, 8 Mar 2004, Spencer Graves wrote:

>       With "gc()" right before each call to "proc.time", as Brian Ripley 
> and Gabor Grothendieck suggested, the times were substantially more 
> stable.  For the for loop, extending the vector with each of 1e5 
> iterations, I got 181.25, 181.27, 182.72, 182.44, and 182.56.  The 
> averages of the last 3 of these tests are as follows: 
> 
>                           10  100 1000 10000  1e+05
> for loop                   0 0.01 0.05  1.13 182.14
> gen e + for loop           0 0.00 0.03  0.26   2.58
> create storage + for loop  0 0.00 0.04  0.39   3.94
> sapply                     0 0.00 0.03  0.32   4.05
> replicate                  0 0.00 0.03  0.31   3.55
> 
> Without "gc()", I got 192.05, 182.02, 126.04, 130.30, and 118.64 for 
> extending the vector with each for loop iteration. 
> 
>       Three more observations about this: 
> 
>       1.  Without "gc()", the times started higher but declined by 
> roughly a third.  This suggests that R may actually be storing 
> intermediate "semi-compiled" code in "garbage" and using it when the 
> situation warrants -- but "gc()" discards it. 

I don't see anything in the code to allow for that possibility.

I believe it's down to the vagarities of garbage collection, and in
particular how the tuning of limits and the mix of level 0,1,2 gc's gets
adjusted during runs.  Here is a small experimental setup:

foo <- function(N)
{
  set.seed(123)
  gct <- gc.time()
  res <- system.time({
    f<-function (x.) { 3.8*x.*(1-x.) + rnorm(1,0,.001) }
    v=c()
    x=.1 # starting point
    for (i in 1:N) { x=f(x); v=append(v,x) }
  })
  gct <- gc.time() - gct
  cbind(res, gct)
}

gc.time(TRUE)
gc()
> foo(1e4)
      res  gct
[1,] 1.39 1.12
[2,] 0.01 0.92
[3,] 1.41 1.07
[4,] 0.00 0.00
[5,] 0.00 0.00
> foo(1e5)
        res    gct
[1,] 218.68 242.86
[2,]  19.98 162.12
[3,] 238.83 246.10
[4,]   0.00   0.00
[5,]   0.00   0.00

so most (if not more than all) of the time is going on garbage collection,
something like 18000 gc's in the second run.

>       2.  Increasing N from 1e4 to 1e5 increased the time NOT by a 
> factor of 10 but by a factor of 161 = 182/1.13 when the length of the 
> vector was extended in each iteration. 

Right, but 9/10 of those additional allocations/garbage collections are of
longer objects than before and so that will take more time.  In
particular, objects of non-small size are directly allocated and freed, so
this will also depend on the speed of your malloc.  How the time to alloc 
n bytes depends on n will be very system-specific.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list