[Rd] Possible page inefficiency in do_matrix in array.c

Tue Sep 4 14:07:32 CEST 2012

> Actually, my apologies, I was assuming that your example was based on the
> SO question while it is not at all (the code is not involved in that test
> case). Reversing the order does indeed cause a delay. Switching to a
> single index doesn't seem to have any impact. R-devel has the faster
> version now (which now also works with large vectors).
>
> Cheers,
> Simon

I was intrigued why the compiler doesn't swap the loops when you thought
it should, though. You're not usually wrong! From GCC's documentation (end
of last paragraph is the most significant) :

====

-floop-interchange
Perform loop interchange transformations on loops. Interchanging two
nested loops switches the inner and outer loops. For example, given a loop
like:
          DO J = 1, M
            DO I = 1, N
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO

loop interchange transforms the loop as if it were written:

          DO I = 1, N
            DO J = 1, M
              A(J, I) = A(J, I) * C
            ENDDO
          ENDDO

which can be beneficial when N is larger than the caches, because in
Fortran, the elements of an array are stored in memory contiguously by
column, and the original loop iterates over rows, potentially creating at
each access a cache miss. This optimization applies to all the languages
supported by GCC and is not limited to Fortran. To use this code
transformation, GCC has to be configured with --with-ppl and --with-cloog
to enable the Graphite loop transformation infrastructure.

====

Could R build scripts be configured to set these gcc flags to turn on
"Graphite", then? I guess one downside could be the time to compile.

Matthew

>
> On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
>
>> On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:
>>
>>>
>>> In do_matrix in src/array.c there is a type switch containing :
>>>
>>> case LGLSXP :
>>>   for (i = 0; i < nr; i++)
>>>   for (j = 0; j < nc; j++)
>>>       LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>
>>> That seems page inefficient, iiuc. Think it should be :
>>>
>>> case LGLSXP :
>>>   for (j = 0; j < nc; j++)
>>>   for (i = 0; i < nr; i++)
>>>       LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>
>>> or more simply :
>>>
>>> case LGLSXP :
>>>   for (i = 0; i < nc*nr; i++)
>>>       LOGICAL(ans)[i] = NA_LOGICAL;
>>>
>>> ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
>>> and nr are type int ).
>>>
>>> Same goes for all the other types in that switch.
>>>
>>> This came up on Stack Overflow here :
>>> http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
>>>
>>
>> That is completely irrelevant - modern compilers will optimize the loops
>> accordingly and there is no difference in speed. If you don't believe
>> it, run benchmarks ;)
>>
>> original
>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>> Unit: milliseconds
>>                               expr      min       lq  median       uq
>>   max
>> 1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196
>> 1409.901
>>
>>
>> swapped
>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>> Unit: milliseconds
>>                               expr      min       lq   median      uq
>>   max
>> 1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001
>> 1246.573
>>
>> Cheers,
>> Simon
>>
>>
>>> Matthew
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
>