[R] question about the Text Mining package tm

onyourmark william108 at gmail.com
Fri Apr 17 08:01:26 CEST 2009


Hello. I am trying to work with the text mining package tm.

I have a directory called textsTweet1 which contains three files
short.txt
myTextFile.txt
myTextFile.csv

short.txt contains one line: THE CAT IN THE HAT\n

myTextFile contains some tweets from Twitter. The first few lines of
myTextFile.txt are:

@oliviamunn I miss a good Yakaniku...I miss Japan...I NEED COCO EVERYBODY. I
NEED TO GET ON JAPAN TIME NOW. NO SLEEP!!!SAKURA at Niigata, Japan 
http://ff.im/-29ufG19:30 [BS Japan] 絶対可憐チルドレン #50 「一意奮闘!オーバー・ザ・フューチャー」RT@
kvsrinath Japan's New Flat Screens: The Eco-Friendly TV . 
http://is.gd/sIS7 #greenMold99 says: Introduction to Chiropractic and manual
therapeutics when unfit.Choice of schools in Japan, and mo... 
http://i.sitesays.com/lc7Japan Said to Sell 17 Trillion Yen of Extra Bonds -
Bloomberg 

Actually there were no new lines in the original file but I inserted a new
line before every occurrence of http.

I ran the following code:
library("tm")
my.path <- 'C:\\dataForR\\textsTweet1\\'
my.path.csv<-'C:\\dataForR\\textsTweet1\\myTextFile.csv'
(ovid <- Corpus(DirSource(my.path), readerControl = list(reader = readPlain,
language = "la")))

Response from R:
A text document collection with 3 text documents
Warning message:
In readLines(filename, encoding = encoding) :
  incomplete final line found on 'C:\dataForR\textsTweet1\/short.txt'

Then I ran the TermDocMatrix function. It is supposed to take a file and
more or less count the occurrences of each word in the file. Or as the
documentation says "Constructs a term-document matrix"

> tdm<-TermDocMatrix(ovid)
> Data(tdm)[1:2, 105:107]
2 x 3 sparse Matrix of class "dgCMatrix"
  revealed said sakura
1        .    .      .
2       15   15     15


> Data(tdm)[1:21, 100:105]
Error in intI(i, n = di[1], dn = dn[[1]]) : index larger than maximal 3

I don't understand why I am getting only two lines. I can see that the first
line is for the short.txt file
and the second line seems to be for the whole myTextFile.txt file.

How can I get TermDocMatrix to output each row of myTextFile.txt as a
separate row?

Thanks very much.
-- 
View this message in context: http://www.nabble.com/question-about-the-Text-Mining-package-tm-tp23091573p23091573.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list