[R] Looking for Feature Hashing

Duncan Murdoch murdoch.duncan at gmail.com
Sun Oct 26 11:43:14 CET 2014


On 25/10/2014, 5:25 AM, Wush Wu wrote:
> Dear all,
> 
> Sorry that I am not sure that whether I should ask the question here or
> R-devel. Is there any existed packages which implements or is implementing
> feature hashing or similar function?
> 
> For who does not know "feature hashing", please let me give a brief
> explanation here.
> 
> Feature hashing is a technique to convert a large amount of string to dummy
> variables quickly( similar to `stats::contrasts` ). For example, if I want
> to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy
> variable, I need to construct a mapping between the string and the index
> (`base::factor`). However, if the `x` has lots of different elements and
> the size of `x` is huge, the overhead of constructing index is large.
> Moreover, the overhead is larger for the distributed environment.
> 
> A good hashing function could be used to map the string to the index
> quickly without the overhead of constructing the index. The probability of
> "collision" might be small if we pick a good hashing function. For details,
> please see http://en.wikipedia.org/wiki/Feature_hashing

The "digest" package implements several different hash functions.  You
could use the hash values as names in an environment to index arbitrary
objects associated with the values.

Duncan Murdoch



More information about the R-help mailing list