[R] Help with stemDocument

Milan Bouchet-Valat nalimilan at club.fr
Sat May 12 13:10:31 CEST 2012


Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit :
> Alekseiy, I tried your recommendation with several variations. It still does
> not run.  I think the problem has to do with R2.15 and the refreshed TM
> package.
It works here with R 2.15.0 and tm 0.5-7.2 (development version), all
other relevant packages of the same version as you (but on Linux 64
bits). So it might not be the problem.

I'm using the docs example as a test:
data("crude")
crude[[1]]
stemDocument(crude[[1]])

> Everything runs under R2.10 with the following code:
> 
> a <- Corpus(VectorSource(df$text)) # create corpus object
> a <- tm_map(a, removePunctuation)
> a <- tm_map(a, removeNumbers)
> a <- tm_map(a, removeWords, stopwords("english"))
> a <- tm_map(a, stripWhitespace)		
> a <- tm_map(a, stemDocument, language = "english") 
Let's focus on the example from the docs, since it's simple. Anyway, you
example is not reproducible since you do not provide the original data.

> 
> This same code ran on R2.15 results in:
> 1. the removeWords working sometimes, and sometimes not.
> 2. and stemDocuments absolutely not working.  
> 
> Both error out.  removeWords always stops reading in the stopword list on
> the same line number  (I have added and subtracted words - no difference) -
> session info is below:
> 
> > a <- tm_map(a, removeWords, stopwords("english"))
> 
> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  : 
>   invalid regular expression
> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
> 
> 
> > a <- tm_map(a, stemDocument, language = "english") 
> Error in .jnew(name) : java.lang.ClassNotFoundException
This error suggests you should reconfigure Java. Have you tried
reinstalling rJava, Snowball, RWekajars and RWeka?

> SessionInfo:
> 
> > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: i386-pc-mingw32/i386 (32-bit)
> 
> locale:
> [1] LC_COLLATE=English_United States.1252 
> [2] LC_CTYPE=English_United States.1252   
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                          
> [5] LC_TIME=English_United States.1252    
> 
> attached base packages:
> [1] stats4    grid      stats     graphics  grDevices utils     datasets 
> [8] methods   base     
> 
> other attached packages:
>  [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
>  [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8   
>  [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
> [13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1   
> 
> loaded via a namespace (and not attached):
>  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17       
>  [5] memoise_0.1        munsell_0.3        proto_0.3-9.2     
> RColorBrewer_1.0-5
>  [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0      
> > 
> Hi Triss, 
> 
> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument
> 
> Best,
> -Alex
> ________________________________________
> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
> [triss.ashton@]
> Sent: 02 May 2012 21:09
> To: r-help@
> Subject: Re: [R] Help with stemDocument
> 
> I am having a problem with stemDocuments also.  I can make it work by moving
> the data into a Corpus by using:
> 
> >  a <- Corpus(VectorSource(df$text)) # create corpus object
> >  a <- tm_map(a, stemDocument, language = "english")
> 
> but it is horrably slow.  I want to stem outside the Corpus object like:
> 
> >df$text <- stemDocument(df$text, language = "english")
> 
> but it returns the original text.
> 
> In fact, using the example in the tm package documentation does not work
> either:
> 
> > data("crude")
> > crude[[1]]
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]], language = "english") # specify language
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]]) # language not specified
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> >
> 
> 
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list