[R] Looking for Feature Hashing

Wush Wu wush978 at gmail.com
Sat Oct 25 11:25:15 CEST 2014


Dear all,

Sorry that I am not sure that whether I should ask the question here or
R-devel. Is there any existed packages which implements or is implementing
feature hashing or similar function?

For who does not know "feature hashing", please let me give a brief
explanation here.

Feature hashing is a technique to convert a large amount of string to dummy
variables quickly( similar to `stats::contrasts` ). For example, if I want
to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy
variable, I need to construct a mapping between the string and the index
(`base::factor`). However, if the `x` has lots of different elements and
the size of `x` is huge, the overhead of constructing index is large.
Moreover, the overhead is larger for the distributed environment.

A good hashing function could be used to map the string to the index
quickly without the overhead of constructing the index. The probability of
"collision" might be small if we pick a good hashing function. For details,
please see http://en.wikipedia.org/wiki/Feature_hashing

Best,
Wush Wu
PhD Student Graduate Institute of Electrical Engineering, National Taiwan
University

	[[alternative HTML version deleted]]



More information about the R-help mailing list