[R] Analyzing texts with tm

Wed Jan 19 11:55:54 CET 2011

Hey everybody!

I have to use R's tm package to do some text analysis, first thing would be to create a term frequency matrix.
Digging in tm's source code it seems like it uses some logic like this to create term frequencies:

data("crude")
(txt <- Content(crude[[1]]))
(tokTxt <- unlist(strsplit(gsub("[^[:alnum:]]+", " ", txt), " ", fixed = TRUE)))
table(factor(tokTxt, levels = c('two')))
table(factor(tokTxt, levels = c('two days')))

Like this code example demostrates the tokenization of the input text makes it impossible to use "a group of words separated by whitespace" as input words.

So my question is: How would you create such a term frequency matrix in R?

Here's some Ruby code I once wrote to show what I want:
txt = "some text containing two days\n"
freq = ['two', 'two days'].inject({}) { |h,w| h[w] = txt.scan(Regexp.compile(" #{w} ")).length; h }
(Reads as: "Given txt: Generate an associative array mapping words to the word's frequency in txt. To count occurences do not split the text at whitespace but instead use a regular expression to search for the word/group of words surrounded by whitespace in txt.")

Thanks in advance for any input!
--