[Rd] shash in unique.c

Matthew Dowle mdowle at mdowle.plus.com
Fri Mar 5 13:40:42 CET 2010

Thanks a lot.  Quick and brief responses below...

"Duncan Murdoch" <murdoch at stats.uwo.ca> wrote in message 
news:4B90F134.6070004 at stats.uwo.ca...
> Matthew Dowle wrote:
>> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No 
>> response resulted in a retry() after a Sys.sleep(10 days).
>> If its a "yes" or "maybe" then I could proceed to try it, test it, and 
>> present the test results and timings to you along with the patch.  It 
>> would be on 32bit Ubuntu first, and I would need to either buy, rent time 
>> on, or borrow a 64bit machine to be able to then test there, owing to the 
>> nature of the suggestion.
>> If its "no", "bad idea because..." or "we were already working on it, or 
>> better",  then I won't spend any more time on it.
>> Matthew
>> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message 
>> news:hlu4qh$l7s$1 at dough.gmane.org...
>>> Looking at shash in unique.c, from R-2.10.1  I'm wondering if it makes 
>>> sense to hash the pointer itself rather than the string it points to?
>>> In other words could the SEXP pointer be cast to unsigned int and the 
>>> usual scatter be called on that as if it were integer?
> Two negative but probably not fatal issues:
> Pointers and ints are not always the same size.  In Win64, ints are 32 
> bits, pointers are 64 bits.  (Can we be sure there is some integer type 
> the same size as a pointer?  I don't know, ask a C expert.)
No we can't be sure. But we could test at runtime, and if the assumption 
wasn't true, then revert to the existing method.

> We might want to save the hash to disk.  On restore, the pointer based 
> hash would be all wrong.  (I don't know if we actually do ever save a hash 
> to disk.  )
The hash table in unique.c appears to be a temporary private hash, different 
to the global R_StringHash. Its private hash appears to be used only while 
the call to unique runs, then free'd. Thats my understanding anyway. The 
suggestion is not to alter the global R_StringHash in any way at all,  which 
is the one that might be saved to disk now or in the future.

> Duncan Murdoch
>>> shash would look like a slightly modified version of ihash like this :
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>>    if (STRING_ELT(x,indx) == NA_STRING) return 0;
>>>    return scatter((unsigned int) (STRING_ELT(x,indx), d);
>>> }
>>> rather than its current form which appears to hash the string it points 
>>> to :
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>>    unsigned int k;
>>>    const char *p;
>>>    if(d->useUTF8)
>>> p = translateCharUTF8(STRING_ELT(x, indx));
>>>    else
>>> p = translateChar(STRING_ELT(x, indx));
>>>    k = 0;
>>>    while (*p++)
>>>     k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
>>>    return scatter(k, d);
>>> }
>>> Looking at sequal, below, and reading its comments, if the pointers are 
>>> equal it doesn't look at the strings they point to, which lead to the 
>>> question above.
>>> static int sequal(SEXP x, int i, SEXP y, int j)
>>> {
>>>    if (i < 0 || j < 0) return 0;
>>>    /* Two strings which have the same address must be the same,
>>>       so avoid looking at the contents */
>>>    if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
>>>    /* Then if either is NA the other cannot be */
>>>    /* Once all CHARSXPs are cached, Seql will handle this */
>>>    if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
>>> return 0;
>>>    return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
>>> }
>>> Matthew
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

More information about the R-devel mailing list