[R] Hashing and environments

William Dunlap wdunlap at tibco.com
Sat Nov 6 23:43:36 CET 2010


I would make make an environemnt called wfreqsEnv
whose entry names are your words and whose entry
values are the information about the words.  I find
it convenient to use [[ to make it appear to be
a list (instead of using exists(), assign(), and get()).
E.g., the following enters the 100,000 words from a
list of 17,576 and records their id numbers and the
number of times each is found in the sample.

> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv())
> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS,
letters, letters)))
# length(words) == 17576
> set.seed(1)
> samp <- sample(seq_along(words), size=100000, replace=TRUE)
> system.time(for(i in samp) {
+    word <- words[i]
+    if (is.null(wfreqsEnv[[word]])) { # new entry
+        wfreqsEnv[[word]] <- list(Count=1, EntryNo=i)
+    } else { # update existing entry
+        wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1
+    }
+})
   user  system elapsed 
   2.28    0.00    2.14 
(The time, in seconds, is from an ancient Windows laptop, c. 2002.)

Here is a small check that we are getting what we expect:
> words[14736]
[1] "Tuv"
> wfreqsEnv[["Tuv"]]
$Count
[1] 8

$EntryNo
[1] 14736

> sum(samp==14736)
[1] 8

If we do this with a non-hashed environment we get the same
answers but the elapsed time is now 34.81 seconds instead of
2.14.  If you make wfreqEnv be a list instead of an environment
then that time is 74.12 seconds (and the answers are the same).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Levy, Roger
> Sent: Saturday, November 06, 2010 1:39 PM
> To: r-help at r-project.org
> Subject: [R] Hashing and environments
> 
> Hi,
> 
> I'm trying to write a general-purpose "lexicon" class and 
> associated methods for storing and accessing information 
> about large numbers of specific words (e.g., their 
> frequencies in different genres).  Crucial to making such a 
> class practically useful is to get hashing working correctly 
> so that information about specific words can be accessed 
> quickly.  But I've never really understood very well how 
> hashing works, so I'm having trouble.
> 
> Here is an example of what I've done so far:
> 
> ***
> 
> setClass("Lexicon",representation(e="environment"))
> setMethod("initialize","Lexicon",function(.Object,wfreqs) {
> 	.Object at e <- new.env(hash=T,parent=emptyenv())
> 	assign("wfreqs",wfreqs,envir=.Object at e)
> 	return(.Object)
> 	})
> 
> ## function to access word frequencies
> wfreq <- function(lexicon,word) {
> 	return(get("wfreqs",envir=lexicon at e)[word])
> }
> 
> ## example of use
> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1))
> wfreq(my.lexicon,"the")
> 
> ***
> 
> However, testing indicates that the way I have set this up 
> does not achieve the intended benefits of having the 
> environment hashed:
> 
> ***
> 
> sample.wfreqs <- trunc(runif(1e5,max=100))
> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs))
> lex <- new("Lexicon",wfreqs=sample.wfreqs)
> words.to.lookup <- trunc(runif(100,min=1,max=1e5))
> ## look up the words directly from the sample.wfreqs vector
> system.time({
> 	for(i in words.to.lookup)
> 		sample.wfreqs[as.character(i)]
> 	},gcFirst=TRUE)
> ## look up the words through the wfreq() function; time 
> approx the same
> system.time({
> 	for(i in words.to.lookup)
> 		wfreq(lex,as.character(i))
> 	},gcFirst=TRUE)
> 
> ***
> 
> I'm guessing that the problem is that the indexing of the 
> wfreqs vector in my wfreq() function is not happening inside 
> the actual lexicon's environment.  However, I have not been 
> able to figure out the proper call to get the lookup to 
> happen inside the lexicon's environment.  I've tried
> 
> wfreq1 <- function(lexicon,word) {
> 	return(eval(wfreqs[word],envir=lexicon at e))
> }
> 
> which I'd thought should work, but this gives me an error:
> 
> > wfreq1(my.lexicon,'the')
> Error in eval(wfreqs[word], envir = lexicon at e) : 
>   object 'wfreqs' not found
> 
> Any advice would be much appreciated!
> 
> Best & many thanks in advance,
> 
> Roger
> 
> --
> 
> Roger Levy                      Email: rlevy at ucsd.edu
> Assistant Professor             Phone: 858-534-7219
> Department of Linguistics       Fax:   858-534-4789
> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list