[R] Removing words and initials with tm

Bob Green bgreen at dyson.brisnet.org.au
Sat Apr 11 13:04:08 CEST 2015


Hello Sun,

The order of the TM transformations makes a lot of difference.

It isn't a shortcut, but if you identify all names you could create 
your own Stop words list:

corpus  <-tm_map(corpus , removeWords, c("english", "  "))

In the case of York,  Key Word in Context (KWIC) syntax could be used 
to check how certain words are used. You could identify the words 
useages you want to remove or retain and respectively rename the 
relevant instances.

This is labour intensive, but Greis in his Quantitative Corpus 
Linguistics, notes that sometimes time spent on trying to refine code 
might be better spent on manual analysis (p164). This book includes a 
KWIC type function (page 127), but I haven't been able to work out 
how to modify it to read more than six words either side of the 
specified word. Six should be adequate for your purpose. Jockers book 
also includes a KWIC function but I don't believe it searches the 
entire corpus, rather a specified text.

I recently checked and TM doesn't have a KWIC function, but for the R 
talented (which excludes me) it might be possible to write one. For 
example, Jim Holtman once wrote a KWIC function to identify word use 
in a csv file.

Bob



More information about the R-help mailing list