[Rd] How x[, 'colname1'] is implemented?

Seth Falcon seth at userprimary.net
Sat Jan 2 01:56:02 CET 2010


On 1/1/10 1:40 PM, Peng Yu wrote:
> On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson
> <b.rowlingson at lancaster.ac.uk> wrote:
>> On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>> I don't see where describes the implementation of '[]'.
>>>
>>> For example, if x is a matrix or a data.frame, how the lookup of
>>> 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the
>>> a hash of the colnames? Is the reference O(1) or O(n), where n is the
>>> second dim of x?
>>
>>  Where have you looked? I doubt this kind of implementation detail is
>> in the .Rd documentation since a regular user doesn't care for it.
> 
> I'm not complaining that it is not documented.
> 
>>  As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":
>>
>>  Line 450 of subscript.c of the source code of R 2.10 is the
>> stringSubscript function. It has this comment:
>>
>> /* The original code (pre 2.0.0) used a ns x nx loop that was too
>>  * slow.  So now we hash.  Hashing is expensive on memory (up to 32nx
>>  * bytes) so it is only worth doing if ns * nx is large.  If nx is
>>  * large, then it will be too slow unless ns is very small.
>>  */
> 
> Could you explain what ns and nx represent?

integers :-)

Consider a 5x5 matrix m and a call like m[ , c("C", "D")], then
in the call to stringSubscript:

  s - The character vector of subscripts, here c("C", "D")

  ns - length of s, here 2

  nx - length of the dimension being subscripted, here 5

  names - the dimnames being subscripted.  Here, perhaps
          c("A", "B", "C", "D", "E")

>> The definition of "large" and "small" here appears to be such that:
>>
>> 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 &&
>> ns)) || (ns * nx > 15*nx + ns) );

The 'in' argument is always TRUE AFAICS so this boils down to:

Use hashing for x[i] if either length(x) > 1000 or length(i) > 1000 (and
we aren't in the trivial case where either length(x) == 0 or length(i) == 0)

OR use hashing if (ns * nx > 15*nx + ns)


+ seth



More information about the R-devel mailing list