[Rd] Peculiar timing result

Douglas Bates bates at stat.wisc.edu
Tue Mar 14 02:10:06 CET 2006


On 3/11/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> Here is a summary of some results on a dual Opteron 252 running FC3
>
> 64-bit gcc 3.4.5
> R's blas                34.83  3.45 38.56
> ATLAS                   36.70  3.28 40.14
> ATLAS multithread       76.85  5.39 82.29
> Goto 1 thread           36.17  3.44 39.76
> Goto multithread       178.06 345.97 467.99
> ACML                    49.69  3.36 53.23
>
> 64-bit gcc 4.1.0
> R's blas                34.98  3.49 38.55
> 32-bit gcc 3.4.5
> R's blas                33.72  3.27 36.99
> 32-bit gcc 4.1.0
> R's blas                34.62  3.25 37.93
>
> The timings are not that repeatable, but the message seems clear that
> this problem does not benefit from a tuned BLAS and the overhead from
> multiple threads is harmful.  (The gcc 4.1.0 results took fewer
> iterations, which skews the results in its favour.)
>
> And my 2GHz Pentium M laptop under Windows gave 39.96  3.68 44.06.
>
> Clearly the Goto BLAS has a problem here: the results are slower on a dual
> 252 than a dual 248 (see below).

Thanks for the information on the timings.  It happens that this
message ended up in my R-help folder and I only got around to reading
that folder today.

Is it ok with you if I forward this message to Simon Urbanek?  I am
having similar difficulties in the timing with R on a dual-core Intel
MacBook.
>
>
> On Fri, 3 Mar 2006, Prof Brian Ripley wrote:
>
> > On Fri, 3 Mar 2006, Douglas Bates wrote:
> >
> >> I have been timing a particular model fit using lmer on several
> >> different computers and came up with a peculiar result - the model fit
> >> is considerably slower on a dual-core Athlon 64 using Goto's
> >> multithreaded BLAS than on a single-core processor.
> >
> > Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an
> > (unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot
> > for tuned BLAS, and (according to the AMD site I just looked up) the X2
> > 3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.
> >
> > Also, the very large system time seen in the dual-core run is typical of
> > what I see when pthreads is not working right, and I suggest you try a
> > limit of one thread (see the R-admin manual).  On our dual-processor
> > Opteron 248 that ran in 44 secs instead of 328.
> >
> >> Here is the timing on a single-core Athlon 64 3000+ running under
> >> today's R-devel with version 0.995-5 of the Matrix package.
> >>
> >>> library(Matrix)
> >>> data(star, package = 'mlmRev')
> >>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> > control = list(nit=0,grad=0,msV=1)))
> >> [1] 43.10  3.78 48.41  0.00  0.00
> >>
> >>
> >> (If you run the timing yourself and don't want to see the iteration
> >> output, take the msV=1 out of the control list.  I keep it in there so
> >> I can monitor the progress.)
> >>
> >> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
> >> the same version of R, BLAS and Matrix package, the timing ends up
> >> with something like
> >>
> >> 90 140 235 0 0
> > ....
> >
> >
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>



More information about the R-devel mailing list