[R] tm package: problem of TermDocumentMatrix and minWordLength

Baoqiang Cao bqcaomail at gmail.com
Wed May 16 16:14:12 CEST 2012


try this:

dtm <- DocumentTermMatrix(examplecorpus, control = list(wordLengths=c(1,100)))



On Wed, May 16, 2012 at 6:22 AM, C.H. <chainsawtiney at gmail.com> wrote:
> Dear All,
>
> The following code illustrate the problem.
>
> [R code]
> require(tm)
> exampledoc <- c("R is good", "R is really good")
> examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8")
> dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1))
> as.matrix(dtm)
> [/R code]
>
> The term "R" and "is" were not included in the dtm even the control
> parameter minWordLength was set to 1.
>
>    Terms
> Docs good really
>   1    1      0
>   2    1      1
>
> Would you reproduce this problem?
>
> The following is my sessionInfo
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] tm_0.5-7.1
>
> loaded via a namespace (and not attached):
> [1] compiler_2.15.0 slam_0.1-23     tools_2.15.0
>
> Regards,
>
> CH
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list