[Rd] RE: [R] when can we expect Prof Tierney's compiled R?

Wed Apr 27 10:12:50 CEST 2005

On Tue, 26 Apr 2005, Vadim Ogranovich wrote:

> Thank you for sharing the benchmark results. The improvement is very
> substantial, I am looking forward to the release of the byte compiler!
>
> The arithmetic shows that x[i]<- is still the bottleneck. I suspect that
> this is due to a very involved dispatching/search for the appropriate
> function on the C level. There might be significant gain if loops
> somehow cached the result of the initial dispatching. This is what you
> probably referred to as additional improvements in the engine itself.

I'd be surprised if dispatching were the issue: have you (C-level) 
profiled to find out?  Please do so: these statements do tend to get 
perpetuated as fact.

You cannot cache the result, as [<- can change the class of x, as could 
other operations done by the rhs (e.g. if it were x[i] <- g(x, i) the 
function g could change its argument).

There are a large number of allocations going on (e.g. you need to create 
a length-one vector x[i-1]), and

> gc.time(TRUE)
[1] 0 0 0 0 0
> system.time(f1(x, iA), gcFirst=TRUE)
[1] 6.46 0.02 6.49 0.00 0.00
> gc.time()
[1] 1.83 1.30 1.85 0.00 0.00

(although note the comment in ?gc.time).  It is typical that gc()-ing is a 
major component of the time for such simple tasks, and allocations can be 
even more.

Note the use of gcFirst=TRUE (one could gc() before gc.time to be even 
fairer).  Comparable figures for f2

> system.time(f2(x, iA), gcFirst=TRUE)
[1] 2.13 0.00 2.13 0.00 0.00
> gc.time()
[1] 0.69 0.54 0.71 0.00 0.00

It's quite typical for gc()-ing to take 1/3 of the time (as reported by 
gc.time) in trivial problems like this.

> Though the examples are very simple, I think they are typical of any
> simulation of dynamic systems where the new state is computed as a
> transformation of the previous state. In my example, the f1(), the state
> was a scalar and the transformation was the identity.

I don't believe they are typical, as the transformation is so trivial.
Of course, `simulation of dynamic systems' is not a typical task for R.

> In any case the timing of the byte-compilation is remarkable!

> Thanks,
> Vadim
>
>
>
>> -----Original Message-----
>> From: Luke Tierney [mailto:luke at stat.uiowa.edu]
>> Sent: Tuesday, April 26, 2005 5:50 PM
>> To: Vadim Ogranovich
>> Cc: Peter Dalgaard; Jason Liao; r-devel at stat.math.ethz.ch
>> Subject: RE: [R] when can we expect Prof Tierney's compiled R?
>>
>> For what it's worth (probably not much as these simple
>> benchmarks are rarely representative of real code and so need
>> to be taken with a huge chunk of salt) here is what happens
>> with your examples in R 2.1.0 with the current byte compiler.
>>
>> Define your examples as functions:
>>
>>      n = 1e6; iA = seq(2,n); x = double(n);
>>      f1 <- function(x, iA) for (i in iA) x[i] = x[i-1]
>>      f2 <- function(x, iA) for (i in iA) x[i-1]
>>      f3 <- function(x, iA) for (i in iA) 1
>>      f4 <- function(x, iA) for (i in iA) x[i] = 1.0
>>      f5 <- function(x, iA) for (i in iA) i-1
>>      f6 <- function(x, iA) for (i in iA) i
>>
>> Make byte compiled versions:
>>
>>      f1c <- cmpfun(f1)
>>      f2c <- cmpfun(f2)
>>      f3c <- cmpfun(f3)
>>      f4c <- cmpfun(f4)
>>      f5c <- cmpfun(f5)
>>      f6c <- cmpfun(f6)
>>
>> and run them:
>>
>>     > system.time(f1(x, iA))
>>      [1] 5.43 0.04 5.56 0.00 0.00
>>     > system.time(f1c(x, iA))
>>      [1] 1.77 0.03 1.81 0.00 0.00
>>
>>     > system.time(f2(x, iA))
>>      [1] 1.72 0.01 1.74 0.00 0.00
>>     > system.time(f2c(x, iA))
>>      [1] 0.63 0.00 0.63 0.00 0.00
>>
>>     > system.time(f3(x, iA))
>>      [1] 0.19 0.00 0.20 0.00 0.00
>>     > system.time(f3c(x, iA))
>>      [1] 0.14 0.00 0.15 0.00 0.00
>>
>>     > system.time(f4(x, iA))
>>      [1] 3.78 0.03 3.82 0.00 0.00
>>     > system.time(f4c(x, iA))
>>      [1] 1.26 0.02 1.30 0.00 0.00
>>
>>     > system.time(f5(x, iA))
>>      [1] 0.99 0.00 1.00 0.00 0.00
>>     > system.time(f5c(x, iA))
>>      [1] 0.30 0.00 0.31 0.00 0.00
>>
>>     > system.time(f6(x, iA))
>>      [1] 0.21 0.00 0.23 0.00 0.00
>>     > system.time(f6c(x, iA))
>>      [1] 0.17 0.00 0.20 0.00 0.00
>>
>> I'll let you do the arithmetic.  The byte compiler does get
>> rid of a fair bit of interpreter overhead (which is large in
>> these kinds of examples compared to most real code) but there
>> is still considerable room for improvement.  The byte code
>> engine currently uses the same internal C code for doing the
>> actual work as the interpreter, so improvements there would
>> help both interpreted and byte compiled code.
>>
>> Best,
>>
>> luke
>>
>> On Fri, 22 Apr 2005, Vadim Ogranovich wrote:
>>
>>> If we are on the subject of byte compilation, let me bring
>> a couple of
>>> examples which have been puzzling me for some time. I'd
>> like to know
>>> a) if the compilation will likely to improve the
>> performance for this
>>> type of computations, and b) at least roughly understand
>> the reasons
>>> for the observed numbers, specifically why x[i]<- assignment is so
>>> much slower than x[i] extraction.
>>>
>>> The loops below are typical in any recursive calculation
>> where the new
>>> value of a vector is based on its immediate neighbor say to
>> the left.
>>> Specifically we assign the previous value to the current element.
>>>
>>> # this shows that the assignment x[i]<- is the bottleneck
>> in the loop
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
>> in iA) x[i]
>>> = x[i-1])
>>> [1] 4.29 0.00 4.30 0.00 0.00
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA)
>>> x[i-1])
>>> [1] 1.46 0.01 1.46 0.00 0.00
>>>
>>> # the overhead of the loop itself is reasonably low, just 0.17 sec
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) 1)
>>> [1] 0.17 0.01 0.17 0.00 0.00
>>>
>>> # pure assignment (w/o the extraction x[i]) takes 3.09 sec.
>> Thus x[i]
>>> as extraction is (3.09 - 0.17)/(0.79 - 0.17) = 4.7 times
>> faster than
>>> x[i]<- as assignment. This looks a bit odd.
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
>> in iA) x[i]
>>> = 1.0)
>>> [1] 3.08 0.00 3.09 0.00 0.00
>>>
>>>
>>> # this shows that just evaluation of (i-1) takes about
>> (0.79 - 0.24) =
>>> 0.55 sec on my machine (AMD 64 bit). Looks too slow.
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
>> in iA) i-1)
>>> [1] 0.79 0.00 0.79 0.00 0.00
>>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) i)
>>> [1] 0.24 0.01 0.24 0.00 0.00
>>>
>>> Thanks,
>>> Vadim
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at stat.math.ethz.ch
>>>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Luke Tierney
>>>> Sent: Friday, April 22, 2005 7:33 AM
>>>> To: Peter Dalgaard
>>>> Cc: Jason Liao; r-help at stat.math.ethz.ch
>>>> Subject: Re: [R] when can we expect Prof Tierney's compiled R?
>>>>
>>>> On Wed, 20 Apr 2005, Peter Dalgaard wrote:
>>>>
>>>>> Luke Tierney <luke at stat.uiowa.edu> writes:
>>>>>
>>>>>> Vectorized operations in R are also as fast as compiled
>> C (because
>>>>>> that is what they are :-)).  A compiler such as the one
>>>> I'm working
>>>>>> on will be able to make most difference for
>>>> non-vectorizable or not
>>>>>> very vectorizable code.  It may also be able to reduce the
>>>> need for
>>>>>> intermediate allocations in vectorizable code, which may
>>>> have other
>>>>>> benefits beyond just speed improvements.
>>>>>
>>>>> Actually, it has struck me a couple of times that these
>>>> operations are
>>>>> not as fast as they could be, since they are outside the
>>>> scope of fast
>>>>> BLAS routines, but "embarrassingly parallel" code could easily be
>>>>> written for the relevant hardware. Even on uniprocessor
>>>> systems there
>>>>> might be speedups that the C compiler cannot find (e.g.
>> because it
>>>>> cannot assume that source and destination of the operation are
>>>>> distinct).
>>>>
>>>> My guess is that for anything beyond basic operations we
>> are doing OK
>>>> on uniprocessors. but it would be useful to do some testing to be
>>>> sure.  For the basic operations I suspect we are paying a
>> heavy price
>>>> for the way we handle recycling, both in terms of overhead as such
>>>> and in terms of inhibiting compiler optimizations. For
>> performance it
>>>> would probably be better to code the scalar-vector,
>>>> equal-length-vector, and general cases separately, though
>> keeping the
>>>> code maintainable may be a bit of a challenge.  Again testing on a
>>>> range of platforms and compilers would be useful.
>>>>
>>>> With multiprocessors likely to become more widely
>> available it would
>>>> be good to look into ways of factoring the vectorized math
>> code so we
>>>> can slide in one that uses threads when approprate.  This should
>>>> dovetail nicely with compilation to identify larger vectorized
>>>> expressions that can be parallelized as a unit; I hope to
>> look into
>>>> this a bit this summer.
>>>>
>>>> luke
>>>>
>>>>
>>>>
>>>> --
>>>> Luke Tierney
>>>> Chair, Statistics and Actuarial Science Ralph E. Wareham
>> Professor of
>>>> Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>>     Actuarial Science
>>>> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>>
>>>> ______________________________________________
>>>> R-help at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide!
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>>
>>>
>>
>> --
>> Luke Tierney
>> Chair, Statistics and Actuarial Science
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>     Actuarial Science
>> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595