[R] package "tm" fails to remove "the" with remove stopwords

Ingo Feinerer feinerer at logic.at
Sun Nov 15 17:05:36 CET 2009


On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel wrote:
> I am using code that previously worked to remove stopwords using package "tm".

Thanks for reporting. This is a bug in the removeWords() function in
tm version 0.5-1 available from CRAN:

> require(tm)
> myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water")
> text.corp <- Corpus(VectorSource(myDocument))
> #########################
> text.corp <- tm_map(text.corp, stripWhitespace)
> text.corp <- tm_map(text.corp, removeNumbers)
> text.corp <- tm_map(text.corp, removePunctuation)
> ## text.corp <- tm_map(text.corp, stemDocument)
> text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
> dtm <- DocumentTermMatrix(text.corp)
> dtm
> dtm.mat <- as.matrix(dtm)
> dtm.mat
> 
> > dtm.mat
>     Terms
> Docs falls fetch hill jack jill mainly pail plain rain ran spain the water
>    1     0     0    0    0    0      0    0     0    1   0     1   1     0
>    2     1     0    0    0    0      1    0     1    0   0     0   0     0
>    3     0     0    1    1    1      0    0     0    0   1     0   0     0
>    4     0     1    0    0    0      0    1     0    0   0     0   0     1

The function removeWords() fails to remove patterns at the beginning or at the end
of a line.

This bug is fixed in the latest development version on R-Forge, and
the fix will be included in the next CRAN release.

Please see
https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/pkg/inst/NEWS?root=tm&view=markup
for a list of all bug fixes and changes between each tm version.

Best regards, Ingo Feinerer

-- 
Ingo Feinerer
Vienna University of Technology
http://www.dbai.tuwien.ac.at/staff/feinerer




More information about the R-help mailing list