[R] How to benchmark speed of load/readRDS correctly

Tue Aug 22 19:12:54 CEST 2017

Note that if you force a garbage collection each iteration the times are
more stable.  However, on the average it is faster to let the garbage
collector decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x
<- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000,
control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99,
1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdunlap at tibco.com> wrote:

> The large value for maximum time may be due to garbage collection, which
> happens periodically.   E.g., try the following, where the
> unlist(as.list()) creates a lot of garbage.  I get a very large time every
> 102 or 51 iterations and a moderately large time more often
>
> mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <-
> unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
> plot(mb$time)
> quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
> #       0%       50%       75%       90%       95%       99%      100%
> # 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
> diff(which(mb$time > quantile(mb$time, .99)))
> # [1] 102  51 102 102 102 102 102 102  51
> diff(which(mb$time > quantile(mb$time, .95)))
> # [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4
>  6 41
> #[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4
> 22
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch>
> wrote:
>
>> Dear all
>>
>> I was thinking about efficient reading data into R and tried several ways
>> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
>> file.Rdata and file.rds contain the same data, the first created with
>> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, '
>> file.rds', compress=F).
>>
>> First I used the function microbenchmark() and was a astonished about the
>> max value of the output.
>>
>> FIRST TEST:
>> > library(microbenchmark)
>> > microbenchmark(
>> +   n <- readRDS('file.rds'),
>> +   load('file.Rdata')
>> + )
>> Unit: milliseconds
>>               expr                     min                lq
>>          mean                    median                uq
>>          max                      neval
>> n <- readRDS(fl1)        106.5956      109.6457         237.3844
>>     117.8956              141.9921              10934.162           100
>>          load(fl2)                  295.0654      301.8162
>> 335.6266              308.3757              319.6965              1915.706
>>             100
>>
>> It looks like the max value is an outlier.
>>
>> So I tried:
>> SECOND TEST:
>> > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed               elapsed
>>              elapsed                 elapsed               elapsed
>>   10.50                   0.11                       0.11
>>        0.11                       0.10                       0.11
>>              0.11                       0.11                       0.12
>>                    0.12
>> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed               elapsed
>>              elapsed                 elapsed               elapsed
>>    1.86                    0.29                       0.31
>>        0.30                       0.30                       0.31
>>              0.30                       0.29                       0.31
>>                    0.30
>>
>> Which confirmed my suspicion; the first time loading the data takes much
>> longer than the following times. I suspect that this has something to do
>> how the data is assigned and that R doesn't has to 'fully' read the data,
>> if it is read the second time.
>>
>> So the question remains, how can I make a realistic benchmark test? From
>> the first test I would conclude that reading the *.rds file is faster. But
>> this holds only for a large number of neval. If I set times = 1 then
>> reading the *.Rdata would be faster (as also indicated by the second test).
>>
>> Thanks for any help or comments.
>>
>> Kind regards
>>
>> Raphael
>> ------------------------------------------------------------
>> ------------------------
>> Raphael Felber, PhD
>> Scientific Officer, Climate & Air Pollution
>>
>> Federal Department of Economic Affairs,
>> Education and Research EAER
>> Agroscope
>> Research Division, Agroecology and Environment
>>
>> Reckenholzstrasse 191, CH-8046 Zürich
>> Phone +41 58 468 75 11 <+41%2058%20468%2075%2011>
>> Fax     +41 58 468 72 01 <+41%2058%20468%2072%2001>
>> raphael.felber at agroscope.admin.ch<mailto:raphael.felber@
>> agroscope.admin.ch>
>> www.agroscope.ch<http://www.agroscope.ch/>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

	[[alternative HTML version deleted]]