[R] Need very fast application of 'diff' - ideas?

Sat Jan 28 18:46:48 CET 2012

On 28 January 2012 at 16:20, Hans W Borchers wrote:
| R. Michael Weylandt <michael.weylandt <at> gmail.com> writes:
| > 
| > I'd write your own diff() that eliminates the method dispatch and
| > argument checking that diff -> diff.default does.
| > 
| > x[-1] - x[-len(x)] # is all you really need.
| > (# you could also try something like c(x[-1], NA) - x which may be
| > marginally faster as it only subsets x once but you should profile to
| > find out)
| > 
| > is probably about as fast as you can get within pure R code (the
| > function overhead will add a little bit of time as well, so if speed
| > is truly the only thing that matters, best not to use it. If you wanna
| > go for even more speed, you'll have to go to compiled code; I'd
| > suggest inline+Rcpp as the easiest way to do so. That could get it
| > down to a single pass through the vector in pure C (or nice C++) which
| > seems to be a lower bound for speed.
| > 
| > Michael
| 
| 
| Python has become astonishingly fast during the last years. On an iMAc with
| 3.06 GHz I can see the following timings (though I do feel a bit suspicious 
| about the timings Python reports):
| 
|     Python      0.040 s     Version 2.6.1, 1e7 integer elements
|     Matlab      0.095 s     Matlab's diff function (Version R2011b)
|     Matlab      0.315 s     Matlab using x(2:N)-x(1:(N-1))
|     R 2.14.1    0.375 s     R's diff() function
|     R           0.365 s     R using x[-1]-x[-N]
|     R           0.270 s     R using c(x[-1],NA)-x)
|     R+Fortran   0.180 s     R function calling .Fortran
|     R+C         0.180 s     R function calling .C
| 
| where---as an example---the C code looks like:
| 
|     void diff(int *n, int *x, int *d)
|     { for (long i=0; i<*n-2; i++) d[i] = x[i+1] - x[i]; }

We looked at a number of these operations in the context of the Rcpp
benchmark for vector convolution.  If you really really care about the last
bit of speed you do a little better using pointer arithmetic rather than [] indexing.

Also, R always checks for NA which is costly.

| There appears to be a factor of 4 between R+compiled code and Python code.
| It is also interesting to see that in Matlab 'diff' is considerably faster
| than differencing vectors, while in R it is slower.
| 
| P. S.:  To make the comparison fair I have used the following Python call:
| 
|     python -m timeit -n 1 -r 1
|         -s 'import numpy' 
|         -s 'arr = numpy.random.randint(0, 1000, (10000000,1)).astype("int32")'
|         'diff = arr[1:] - arr[:-1]'
| 
| i.e., used 32-bit integers and included the indexing in the loop.

We should be able to close the gap from 40ms for Python to 180ms for R + C. I
suspect there is some room left.  Can you post your codes ?

Dirk

| > On Fri, Jan 27, 2012 at 7:15 PM, Kevin Ummel <kevinummel <at> gmail.com> wrote:
| > > Hi everyone,
| > >
| > > Speed is the key here.
| > >
| > > I need to find the difference between a vector and its one-period lag
| > > (i.e. the difference between each value and the subsequent one in the 
| > > vector). Let's say the vector contains 10 million random integers
| > > between 0 and 1,000. The solution vector will have 9,999,999 values,
| > > since their is no lag for the 1st observation.
| > >
| > > In R we have:
| > >
| > > #Set up input vector
| > > x = runif(n=10e6, min=0, max=1000)
| > > x = round(x)
| > >
| > > #Find one-period difference
| > > y = diff(x)
| > >
| > > Question is: How can I get the 'diff(x)' part as fast as absolutely
| > > possible? I queried some colleagues who work with other languages, and
| > > they provided equivalent solutions in Python and Clojure that, on their
| > > machines, appear to be potentially much faster
| > > (I've put the code below in case anyone is interested).
| > > However, they mentioned that the overhead in passing the data between 
| > > languages could kill any improvements. I don't have much experience 
| > > integrating other languages, so I'm hoping the community has some ideas
| > > about how to approach this particular problem...
| > >
| > > Many thanks,
| > > Kevin
| > >
| > > In iPython:
| > >
| > > In [3]: import numpy as np
| > > In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16")
| > > In [5]: arr1 = arr[1:].view()
| > > In [6]: timeit arr2 = arr1 - arr[:-1]
| > > 10 loops, best of 3: 20.1 ms per loop
| > >
| > > In Clojure:
| > >
| > > (defn subtract-lag
| > >  [n]
| > >  (let [v (take n (repeatedly rand))]
| > >    (time (dorun (map - v (cons 0 v))))))
| >
| 
| ______________________________________________
| R-help at r-project.org mailing list
| https://stat.ethz.ch/mailman/listinfo/r-help
| PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
| and provide commented, minimal, self-contained, reproducible code.

-- 
"Outside of a dog, a book is a man's best friend. Inside of a dog, it is too
dark to read." -- Groucho Marx