[Rd] Peculiar timing result

Fri Mar 3 16:45:11 CET 2006

I have been timing a particular model fit using lmer on several
different computers and came up with a peculiar result - the model fit
is considerably slower on a dual-core Athlon 64 using Goto's
multithreaded BLAS than on a single-core processor.

Here is the timing on a single-core Athlon 64 3000+ running under
today's R-devel with version 0.995-5 of the Matrix package.

> library(Matrix)
> data(star, package = 'mlmRev')
> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))
  0      241720.:  1.16440 0.335239  0.00000  1.78732 0.867209 0.382318  0.00000
  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
0.341520 0.00908757
  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
0.339296 0.00954718
  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
0.275100 0.0236849
  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
0.261033 0.0267073
  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
0.245076 0.0301134
  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
0.208404 0.0375456
  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
0.126101 0.0524965
  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
0.0718116 0.0628958
  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
0.258359 0.0806590
 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
0.109922 0.201809
 11      238892.:  2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
0.106516 0.216460
 13      238861.:  2.80325 0.126935  1.05912 0.733914 0.556162 0.159296 0.360938
 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
0.0749944 0.976142
 15      238811.:  2.87350 0.136332  1.59046 0.653141 0.353763 0.226061  1.79285
 16      238810.:  2.87663 0.125135  1.58992 0.656808 0.352605 0.220488  1.79282
 17      238806.:  2.89342 0.141551  1.58607 0.676523 0.344212 0.181833  1.79268
 18      238804.:  2.90080 0.125137  1.56624 0.682921 0.261295 0.180598  1.74325
 19      238802.:  2.91950 0.128569  1.56836 0.680436 0.336051 0.159940  1.80400
 20      238801.:  2.92795 0.134762  1.56597 0.685121 0.331695 0.145547  1.80414
 21      238801.:  2.93741 0.125667  1.56139 0.687827 0.332700 0.138854  1.81495
 22      238800.:  2.94588 0.131757  1.55294 0.687909 0.330414 0.137834  1.82826
 23      238799.:  2.96867 0.129716  1.52943 0.688678 0.323171 0.139912  1.84615
 24      238799.:  2.98994 0.133378  1.52188 0.700038 0.337387 0.131403  1.77623
 25      238799.:  3.00312 0.135308  1.51475 0.697550 0.311750 0.145683  1.78037
 26      238799.:  3.00461 0.129920  1.51083 0.697666 0.306722 0.138745  1.81188
 27      238799.:  3.00504 0.134882  1.50539 0.696745 0.302949 0.135897  1.84405
 28      238799.:  3.00422 0.134013  1.47947 0.698115 0.303243 0.133806  1.86486
 29      238799.:  3.00819 0.134378  1.48185 0.701871 0.307097 0.134637  1.84996
 30      238799.:  3.01313 0.134279  1.49098 0.702883 0.304788 0.133682  1.86254
 31      238799.:  3.01291 0.134253  1.49060 0.701818 0.303155 0.133771  1.84613
 32      238799.:  3.01142 0.134314  1.48921 0.701782 0.303589 0.134439  1.84653
 33      238799.:  3.01174 0.134315  1.48926 0.701641 0.304120 0.134145  1.84635
 34      238799.:  3.01175 0.134304  1.48942 0.701740 0.303762 0.134185  1.84649
 35      238799.:  3.01173 0.134307  1.48937 0.701724 0.303809 0.134206  1.84647
[1] 43.10  3.78 48.41  0.00  0.00

(If you run the timing yourself and don't want to see the iteration
output, take the msV=1 out of the control list.  I keep it in there so
I can monitor the progress.)

If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
the same version of R, BLAS and Matrix package, the timing ends up
with something like

90 140 235 0 0

I do see that the multithreading is working on a calculation that is
essentially BLAS-bound such as

> mm <- Matrix(rnorm(1e6), nc = 1e3)
> system.time(crossprod(mm))
[1] 0.57 0.02 0.60 0.00 0.00

On the X2 processor it still takes about 0.6 seconds user time but
only 0.3 seconds elapsed time when the machine is otherwise idle and
both cores are available for the calculation.

Any suggestions why the dual-core processor is so much slower than the
single core processor?

By the way, I would be interested in accumulating timings of this
model fit on other systems.  If you do time it please send me
(off-list) a summary of the version of R, version of the accelerated
BLAS if you use them, processor speed and configuration (i.e.
multiprocessor, multicore, etc.) and, if you know it, memory speed.

This is an example of a complex multilevel model with crossed grouping
factors fit to a relatively large (30000 observations on 10000
students, 1400 teachers and 80 schools) data set.