[R] Using OpenBLAS with R

Mon Nov 17 09:37:58 CET 2014

Useful and interesting.  Thanks for your prompt reply.

-- Mike

On Sun, Nov 16, 2014 at 2:29 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
> On 16/11/2014 00:11, Michael Hannon wrote:
>>
>> Greetings.  I'd like to get some advice about using OpenBLAS with R,
>> rather
>> than using the BLAS that comes built in to R.
>
>
> That was really a topic for the R-devel list: see the posting guide.
>
>> I've tried this on my Fedora 20 system (see the appended for details).  I
>> ran
>> a simple test -- multiplying two large matrices -- and the results were
>> very
>> impressive, i.e., in favor of OpenBLAS, which is consistent with
>> discussions
>> I've seen on the web.
>
>
> If that is all you do, then you should be using an optimized BLAS, and
> choose the one(s) best for your (unstated) machine(s).
>
>> My concern is that maybe this is too good to be true.  I.e., the standard
>> R
>> configuration is vetted by thousands of people every day.  Can I have the
>> same
>> degree of confidence with OpenBLAS that I have in the built-in version?
>
>
> No.  And it is 'too good to be true' for most users of R, for whom BLAS
> operations take a negligible proportion of their CPU time.
>
>> And/or are there other caveats to using OpenBLAS of which I should be
>> aware?
>
>
> Yes: see the 'R Installation and Administration Manual'.  Known issues
> include:
>
> 1) Optimized BLAS trade accuracy for speed.   Surprisingly much published R
> code relies on using extended-precision FPU registers for intermediate
> results, which optimized BLAS do much less than the reference BLAS.
>
> Some packages rely on a particular sign of the solution to svd or eigen
> problems: people then report as bugs that optimized BLAS give a different
> sign from the reference BLAS.
>
> 2) Fast BLAS normally use multi-threading: that usually helps elapsed time
> for a single R task at the expense of increased total CPU time. Fine if you
> have unused CPU cores, but not advantageous in a fully-used multi-core
> machine, e.g. one that is doing many R sessions in parallel.
>
> 3) Many BLAS optimize their use of CPU caches.  This works best if the
> BLAS-using process is the only task running on a particular core (or CPU
> where CPU cores share cache).  (It also means that optimizing on one CPU
> model and running on another can be disastrous.)
>
>
>>
>> Thanks.
>>
>> -- Mike
>>
>> #### Here's the version of R, compiled locally with configuration options:
>> #### ./configure --enable-R-shlib --enable-BLAS-shlib
>>
>> $ R
>>
>> R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
>> Copyright (C) 2014 The R Foundation for Statistical Computing
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> .
>> .
>> .
>>
>> #### Here's the R source code for this little test:
>>
>> library(microbenchmark)
>>
>> mSize <- 10000
>> set.seed(42)
>>
>> aMat <- matrix(rnorm(mSize * mSize), nrow=mSize)
>> bMat <- matrix(rnorm(mSize * mSize), nrow=mSize)
>>
>> cMat <- aMat %*% bMat  ## do the calculation once to see that it works
>>
>> traceCMat <- sum(diag(cMat))  ## a mild sanity check on the calculation
>> traceCMat
>>
>> microbenchmark(aMat %*% bMat, times=5L)  ## repeat a few more times
>>
>> -----
>>
>> #### Here is the output from code, running under various conditions:
>>
>>> traceCMat ###### Using the built-in BLAS from R
>>
>> [1] -11367.55
>>>
>>> microbenchmark(aMat %*% bMat, times=5L)
>>
>> Unit: seconds
>>            expr      min       lq     mean   median       uq     max neval
>>   aMat %*% bMat 675.0064 675.5325 675.4897 675.5857 675.6618 675.662     5
>>
>> ----------
>>
>>> traceCMat  ###### Using libopenblas.so from Fedora
>>
>> [1] -11367.55
>>>
>>> microbenchmark(aMat %*% bMat, times=5L)
>>
>> Unit: seconds
>>            expr      min       lq     mean   median       uq      max
>> neval
>>   aMat %*% bMat 70.67843 70.70545 70.76365 70.73026 70.83935 70.86475
>> 5
>>>
>>>
>>
>> ----------
>>
>>> traceCMat <- sum(diag(cMat))  ###### libopenblas.so from Fedora with
>>> traceCMat                     ###### export OMP_NUM_THREADS=6
>>
>> [1] -11367.55
>>>
>>> microbenchmark(aMat %*% bMat, times=5L)
>>
>> Unit: seconds
>>            expr      min       lq    mean   median       uq      max neval
>>   aMat %*% bMat 69.99146 70.02426 70.3466 70.08327 70.39537 71.23866     5
>>>
>>>
>>
>> ###### Fedora libopenblas.so appears to be single-threaded
>>
>> ----------
>>
>>> traceCMat <- sum(diag(cMat))  ###### libopenblas.so compiled locally
>>> traceCMat                     ###### from source w/OMP_NUM_THREADS=6
>>
>> [1] -11367.55
>>>
>>> microbenchmark(aMat %*% bMat, times=5L)
>>
>> Unit: seconds
>>            expr      min       lq     mean   median       uq      max
>> neval
>>   aMat %*% bMat 26.77385 27.10434 27.17862 27.12485 27.16301 27.72705
>> 5
>>>
>>>
>>
>> ###### Locally-compiled openblas appears to be multi-threaded
>> ###### The microbenchmark appeared to use all 8 processors, even
>> ###### though I asked for only 6.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Emeritus Professor of Applied Statistics, University of Oxford
> 1 South Parks Road, Oxford OX1 3TG, UK