[R] Hashing and environments

Sun Nov 7 00:09:32 CET 2010

some of this can be automated using the CRAN package
hash.

Kjetil

On Sat, Nov 6, 2010 at 10:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
> I would make make an environemnt called wfreqsEnv
> whose entry names are your words and whose entry
> values are the information about the words.  I find
> it convenient to use [[ to make it appear to be
> a list (instead of using exists(), assign(), and get()).
> E.g., the following enters the 100,000 words from a
> list of 17,576 and records their id numbers and the
> number of times each is found in the sample.
>
>> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv())
>> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS,
> letters, letters)))
> # length(words) == 17576
>> set.seed(1)
>> samp <- sample(seq_along(words), size=100000, replace=TRUE)
>> system.time(for(i in samp) {
> +    word <- words[i]
> +    if (is.null(wfreqsEnv[[word]])) { # new entry
> +        wfreqsEnv[[word]] <- list(Count=1, EntryNo=i)
> +    } else { # update existing entry
> +        wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1
> +    }
> +})
>   user  system elapsed
>   2.28    0.00    2.14
> (The time, in seconds, is from an ancient Windows laptop, c. 2002.)
>
> Here is a small check that we are getting what we expect:
>> words[14736]
> [1] "Tuv"
>> wfreqsEnv[["Tuv"]]
> $Count
> [1] 8
>
> $EntryNo
> [1] 14736
>
>> sum(samp==14736)
> [1] 8
>
> If we do this with a non-hashed environment we get the same
> answers but the elapsed time is now 34.81 seconds instead of
> 2.14.  If you make wfreqEnv be a list instead of an environment
> then that time is 74.12 seconds (and the answers are the same).
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Levy, Roger
>> Sent: Saturday, November 06, 2010 1:39 PM
>> To: r-help at r-project.org
>> Subject: [R] Hashing and environments
>>
>> Hi,
>>
>> I'm trying to write a general-purpose "lexicon" class and
>> associated methods for storing and accessing information
>> about large numbers of specific words (e.g., their
>> frequencies in different genres).  Crucial to making such a
>> class practically useful is to get hashing working correctly
>> so that information about specific words can be accessed
>> quickly.  But I've never really understood very well how
>> hashing works, so I'm having trouble.
>>
>> Here is an example of what I've done so far:
>>
>> ***
>>
>> setClass("Lexicon",representation(e="environment"))
>> setMethod("initialize","Lexicon",function(.Object,wfreqs) {
>>       .Object at e <- new.env(hash=T,parent=emptyenv())
>>       assign("wfreqs",wfreqs,envir=.Object at e)
>>       return(.Object)
>>       })
>>
>> ## function to access word frequencies
>> wfreq <- function(lexicon,word) {
>>       return(get("wfreqs",envir=lexicon at e)[word])
>> }
>>
>> ## example of use
>> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1))
>> wfreq(my.lexicon,"the")
>>
>> ***
>>
>> However, testing indicates that the way I have set this up
>> does not achieve the intended benefits of having the
>> environment hashed:
>>
>> ***
>>
>> sample.wfreqs <- trunc(runif(1e5,max=100))
>> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs))
>> lex <- new("Lexicon",wfreqs=sample.wfreqs)
>> words.to.lookup <- trunc(runif(100,min=1,max=1e5))
>> ## look up the words directly from the sample.wfreqs vector
>> system.time({
>>       for(i in words.to.lookup)
>>               sample.wfreqs[as.character(i)]
>>       },gcFirst=TRUE)
>> ## look up the words through the wfreq() function; time
>> approx the same
>> system.time({
>>       for(i in words.to.lookup)
>>               wfreq(lex,as.character(i))
>>       },gcFirst=TRUE)
>>
>> ***
>>
>> I'm guessing that the problem is that the indexing of the
>> wfreqs vector in my wfreq() function is not happening inside
>> the actual lexicon's environment.  However, I have not been
>> able to figure out the proper call to get the lookup to
>> happen inside the lexicon's environment.  I've tried
>>
>> wfreq1 <- function(lexicon,word) {
>>       return(eval(wfreqs[word],envir=lexicon at e))
>> }
>>
>> which I'd thought should work, but this gives me an error:
>>
>> > wfreq1(my.lexicon,'the')
>> Error in eval(wfreqs[word], envir = lexicon at e) :
>>   object 'wfreqs' not found
>>
>> Any advice would be much appreciated!
>>
>> Best & many thanks in advance,
>>
>> Roger
>>
>> --
>>
>> Roger Levy                      Email: rlevy at ucsd.edu
>> Assistant Professor             Phone: 858-534-7219
>> Department of Linguistics       Fax:   858-534-4789
>> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>