[R] findFreqTerms vs minDocFreq in Package 'tm'

vioravis vioravis at gmail.com
Mon Sep 12 08:28:28 CEST 2011


I am using 'tm' package for text mining and facing an issue with finding the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
am getting drastically different results with both. I have given the results
from both the commands below:

findFreqTerms identifies 3140 words that appear more than 5 times but
minDocFreq identifies only 659 terms. Can someone please explain the reason
for the different or whether I have misunderstood their definitions??


>tdm1 <- TermDocumentMatrix(tr1,control=list(weighting=weightBin))
> freq_terms <- findFreqTerms(tdm1, lowfreq =5, highfreq = Inf) 
> str(freq_terms)
 chr [1:3140] "abc" "abil" "abl" "abnorm" "abort" "absenc" ...


> tdm2 <- TermDocumentMatrix(tr1,control=list(minDocFreq=5,minWordLength=1))
> str(tdm2)
List of 6
 $ i       : int [1:4703] 173 616 624 241 350 534 563 609 129 333 ...
 $ j       : int [1:4703] 1 2 3 7 7 7 7 8 10 10 ...
 $ v       : num [1:4703] 7 5 6 9 5 7 5 5 5 7 ...
 $ nrow    : int 659
 $ ncol    : int 5677
 $ dimnames:List of 2
  ..$ Terms: chr [1:659] "\024" "\026" "ac" "access" ...
  ..$ Docs : chr [1:5677] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"


Thank you.

Ravi



--
View this message in context: http://r.789695.n4.nabble.com/findFreqTerms-vs-minDocFreq-in-Package-tm-tp3806644p3806644.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list