[Rd] compiling R | multi-Opteron | BLAS source

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Aug 1 19:42:48 CEST 2006


The R-devel version of R provides a pluggable BLAS, which makes such tests 
fairly easy (although building the BLAS themselves is not).  On dual 
Opterons, using multiple threads is often not worthwhile and can be 
counter-productive (Doug Bates has found some dramatic examples, and you 
can see them in my timings below).

So timings for FC3, gcc 3.4.6, dual Opteron 252, 64-bit build of R. ACML 
3.5.0 is by far the easiest to install (on R-devel all you need to do is 
to link libacml.so to lib/libRblas.so) and pretty competitive, so that is 
what I normally use.

These timings are not very repeatable: to a few % only even after 
averaging quite a few runs.

set.seed(123)
X <- matrix(rnorm(1e6), 1000)
system.time(for(i in 1:25) X%*%X)
system.time(for(i in 1:25) solve(X))
system.time(for(i in 1:10) svd(X))

internal BLAS (-O3)
> system.time(for(i in 1:25) X%*%X)
[1] 96.939  0.341 97.375  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 110.316   1.652 112.006   0.000   0.000
> system.time(for(i in 1:10) svd(X))
[1] 165.550   1.131 166.806   0.000   0.000

Goto 1.03, 1 thread
> system.time(for(i in 1:25) X%*%X)
[1] 12.949  0.191 13.143  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 23.201  1.449 24.652  0.000  0.000
> system.time(for(i in 1:10) svd(X))
[1] 43.318  1.016 44.361  0.000  0.000

Goto 1.03, dual CPU
> system.time(for(i in 1:25) X%*%X)
[1] 15.038  0.244  8.488  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 26.569  2.239 19.814  0.000  0.000
> system.time(for(i in 1:10) svd(X))
[1] 59.912  1.799 50.350  0.000  0.000

ACML 3.5.0 (single-threaded)
> system.time(for(i in 1:25) X%*%X)
[1] 13.794  0.368 14.164  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 22.990  1.695 24.710  0.000  0.000
> system.time(for(i in 1:10) svd(X))
[1] 48.267  1.373 49.662  0.000  0.000

ATLAS 3.6.0, single-threaded
> system.time(for(i in 1:25) X%*%X)
[1] 16.164  0.404 16.572  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 26.200  1.704 27.907  0.000  0.000
> system.time(for(i in 1:10) svd(X))
[1] 50.150  1.462 51.619  0.000  0.000

ATLAS 3.6.0, multi-threaded
> system.time(for(i in 1:25) X%*%X)
[1] 17.657  0.468  9.775  0.000  0.000
> system.time(for(i in 1:25) solve(X))
[1] 38.388  2.353 30.141  0.000  0.000
> system.time(for(i in 1:10) svd(X))
[1] 95.611  3.039 88.917  0.000  0.000


On Sun, 23 Jul 2006, Evan Cooch wrote:

> Greetings -
> 
> A quick perusal of some of the posts to this maillist suggest the level 
> of the questions is probably beyond someone working at my level, but at 
> the risk of looking foolish publicly (something I find I get 
> increasingly comfortable with as I get older), here goes:
> 
> My research group recently purchased a multi-Opteron system (bunch of 
> 880 chips), running 64-bit RHEL 4 (which we have site licensed at our 
> university, so it cost us nothing - good price) with SMP support built 
> into the kernel (perhaps obviously, for a multi-pro system). Several of 
> our user use [R], which I've only used on a few occasions. However, it 
> is part of my task to get [R] installed for folks using this system.
> 
> While the simple, basic compile sequence (./configure, make, make check, 
> make install) went smoothly, its pretty clear from our benchmarks that 
> the [R] code isn't running as 'rocket-fast' as it should for a system 
> like this. So, I dig a bit deeper. Most of the jobs we want to run could 
> benefit from BLAS support (lots of array manipulations and other bits of 
> linear algebra), and a few other compilation optimizations - and here is 
> where I seek advice.
> 
> 1) Looks like there are 3-4 flavours: LAPACK, ATLAS, ACML 
> (AMD-chips...), and Goto. In reading what I can find, it seems that 
> there are reasons not to use ACML (single-thread) despite the AMD chips, 
> reasons to avoid ATLAS (some hassles compiling on RHEL 4 boxes), reasons 
> to avoid LAPACK (ibid), but apparently no problems with Goto BLAS.
> 
> Is that a reasonable summary? At the risk of starting a larger 
> discussion, I'm simply looking to get BLAS support, yielding the fastest 
> [R] code with the minimum of hassles (while tweaking lines of configure 
> fies,  weird linker sequences and all that used to appeal when I was a 
> student, I don't have time at this stage). So, any quick recommendation 
> for *which* BLAS library? My quick assessment suggests goto BLAS, but 
> I'm hoping for some confirmation.
> 
> 3) compilation of BLAS - I can compile for 32-bit, or 64-bit. 
> Presumably, given we've invested in 64-bit chips, and a 64-bit OS, we'd 
> like to consider a 64-bit compilation. Which, also presumably, means 
> we'd need 64-bit compilation for [R]. While I've read the short blurb on 
> CRAN concerning 64-bi vs 32-bit compilations (data size vs speed), I'd 
> be happy to have both on our machine. But, I'm not sure how one 
> specifies 64-bits in the [R] compilation - what flags to I need to set 
> during ./configure, or what config file do I need to edit?
> 
> Thanks very much in advance - and, again, apologies for the 'low-level' 
> of these questions, but one needs to start somewhere.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list