[R] Help with stemDocument

Triss.Ashton triss.ashton at unt.edu
Fri May 18 17:30:27 CEST 2012


Thanks Milan, it is running now.  It seems part of the problem, as you
suggested were the packages.  It seems that although I just installed Rweka,
Snowball and the like they were out of date.  So updataing fixed
stemDocument. As for removeWords, that began working once I cut my data in
half.  Apparently there are some memory management issues I have yet to
figure out.  Thanks again for the help.

Triss



Milan Bouchet-Valat wrote
> 
> Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit :
>> Alekseiy, I tried your recommendation with several variations. It still
>> does
>> not run.  I think the problem has to do with R2.15 and the refreshed TM
>> package.
> It works here with R 2.15.0 and tm 0.5-7.2 (development version), all
> other relevant packages of the same version as you (but on Linux 64
> bits). So it might not be the problem.
> 
> I'm using the docs example as a test:
> data("crude")
> crude[[1]]
> stemDocument(crude[[1]])
> 
>> Everything runs under R2.10 with the following code:
>> 
>> a <- Corpus(VectorSource(df$text)) # create corpus object
>> a <- tm_map(a, removePunctuation)
>> a <- tm_map(a, removeNumbers)
>> a <- tm_map(a, removeWords, stopwords("english"))
>> a <- tm_map(a, stripWhitespace)		
>> a <- tm_map(a, stemDocument, language = "english") 
> Let's focus on the example from the docs, since it's simple. Anyway, you
> example is not reproducible since you do not provide the original data.
> 
>> 
>> This same code ran on R2.15 results in:
>> 1. the removeWords working sometimes, and sometimes not.
>> 2. and stemDocuments absolutely not working.  
>> 
>> Both error out.  removeWords always stops reading in the stopword list on
>> the same line number  (I have added and subtracted words - no difference)
>> -
>> session info is below:
>> 
>> > a <- tm_map(a, removeWords, stopwords("english"))
>> 
>> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  : 
>>   invalid regular expression
>> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
>> 
>> 
>> > a <- tm_map(a, stemDocument, language = "english") 
>> Error in .jnew(name) : java.lang.ClassNotFoundException
> This error suggests you should reconfigure Java. Have you tried
> reinstalling rJava, Snowball, RWekajars and RWeka?
> 
>> SessionInfo:
>> 
>> > sessionInfo()
>> R version 2.15.0 (2012-03-30)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>> 
>> locale:
>> [1] LC_COLLATE=English_United States.1252 
>> [2] LC_CTYPE=English_United States.1252   
>> [3] LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C                          
>> [5] LC_TIME=English_United States.1252    
>> 
>> attached base packages:
>> [1] stats4    grid      stats     graphics  grDevices utils     datasets 
>> [8] methods   base     
>> 
>> other attached packages:
>>  [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
>>  [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8   
>>  [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1       
>> twitteR_0.99.19  
>> [13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1   
>> 
>> loaded via a namespace (and not attached):
>>  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17       
>>  [5] memoise_0.1        munsell_0.3        proto_0.3-9.2     
>> RColorBrewer_1.0-5
>>  [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1 
>> scales_0.2.0      
>> > 
>> Hi Triss, 
>> 
>> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument
>> 
>> Best,
>> -Alex
>> ________________________________________
>> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
>> [triss.ashton@]
>> Sent: 02 May 2012 21:09
>> To: r-help@
>> Subject: Re: [R] Help with stemDocument
>> 
>> I am having a problem with stemDocuments also.  I can make it work by
>> moving
>> the data into a Corpus by using:
>> 
>> >  a <- Corpus(VectorSource(df$text)) # create corpus object
>> >  a <- tm_map(a, stemDocument, language = "english")
>> 
>> but it is horrably slow.  I want to stem outside the Corpus object like:
>> 
>> >df$text <- stemDocument(df$text, language = "english")
>> 
>> but it returns the original text.
>> 
>> In fact, using the example in the tm package documentation does not work
>> either:
>> 
>> > data("crude")
>> > crude[[1]]
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>>     The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>>     "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>>     Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>>  Reuter
>> > stemDocument(crude[[1]], language = "english") # specify language
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>>     The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>>     "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>>     Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>>  Reuter
>> > stemDocument(crude[[1]]) # language not specified
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>>     The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>>     "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>>     Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>>  Reuter
>> >
>> 
>> 
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 
>> ______________________________________________
>> R-help@ mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


--
View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4630523.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list