[R] Help: stemming and stem completion with package tm in R

Felix Andrews felix at nfrac.org
Mon Nov 7 13:38:24 CET 2011


Hi Yanchang,

The problem seems to be that stemCompletion only looks for words that
begin with "mine", and "mining" does not strictly begin with "mine". I
don't think there is any easy way to modify stemCompletion to get
around that.

However, maybe you could substitute the most prevalent word in your
document for each of the stemmed words, then you would not need to use
stemCompletion at all: e.g.

topfreq <- function(x) rev(names(sort(table(x))))[1]
(d <- ave(a, b, FUN = topfreq))
# [1] "mining" "miners" "mining"

Cheers
Felix

On 4 November 2011 12:28, Yanchang Zhao <yanchangzhao at gmail.com> wrote:
> Hi All
>
> I came across a problem below when doing stemming and stem completion
> with package tm in R. Word "mining" was stemmed to "mine" with
> stemDocument(), and then completed to "miners"with stemCompletion().
> However, I prefer to keep "mining" intact.
>
> For stemCompletion(), the default type of completion is "prevalent",
> which takes the most frequent match as completion. Although "mining"
> is much more frequent than "miners" in my text, it still completed
> "mine" to "miners".
>
> An example is shown below.
>
> ############################################
> library(tm)
> (a <- c("mining", "miners", "mining"))
> (b <- stemDocument(a))
> (d <- stemCompletion(b, dictionary=a))
> ############################################
>
> Some possible solutions are:
> 1) to change the options or dictionary in stemDocument(), so that
> "mining" is not stemmed to "mine", which I think is the best way;
> 2) to change the options or dictionary in stemCompletion(), so that
> "mine" is completed to "mining";
> 3) to manually correct this after stem completion, which is the last
> option.
>
> I am looking for a solution for above 1) or 2), but cannot find the
> way to do it with stemDocument() in package tm.
>
> Any help will be appreciated.
>
> Thanks,
> Yanchang Zhao
> Email: yanchangzhao(at)gmail.com
>
> RDataMining:           http://www.rdatamining.com
> Twitter:               http://twitter.com/RDataMining
> Group on Linkedin:   http://group2.rdatamining.com
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Felix Andrews / 安福立
http://www.neurofractal.org/felix/



More information about the R-help mailing list