[Rd] dict package: dictionary data structure for R

Sun Jul 22 04:40:44 CEST 2007

"Gabor Grothendieck" <ggrothendieck at gmail.com> writes:

> Although the proto package is not particularly aimed at hashing note
> that it covers some of the same ground and also is based on a well
> thought out object model (known as object-based programming
> or prototype programming).

Interesting.  The dict package differs from proto in that it _is_
aimed at hashing and:

  - It is S4 based

  - It does not use R's environment objects to implement its
    hashtables (proto uses environments).

In Bioconductor, we have many hashtables where the key is an
Affymetrix probeset ID.  These look sort of like "1000_at".  It turns
out that the algorithm used by R's environments is not very good at
hashing these values.  The dict package lets you investigate this:

   library("dict")
   keys2 = paste(seq(1000, length=13000), "at", sep="_")

   # here, hash.alg=0L corresponds to the hashing function used by R's
   # environments.  I know, a name would be better.
   > summary(as.integer(table(hashCodes(keys=keys2, hash.alg=0L, size=2^14))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    800    1100    1500    1625    2025    2700 
   # hash.alg=1L is djb2 from here: http://www.cse.yorku.ca/~oz/hash.html 
   > summary(as.integer(table(hashCodes(keys=keys2, hash.alg=1L, size=2^14))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   1.648   2.000   4.000 

  # and this is what we see with an environment:
    > e = new.env(hash=T, size=2^14)
    > for (k in keys2) e[[k]] = k
    > summary(env.profile(e)$counts)
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
       0.0000    0.0000    0.0000    0.7935    0.0000 2700.0000 

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org