[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Milan Bouchet-Valat nalimilan at club.fr
Tue Apr 9 21:55:42 CEST 2013


Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
> Hi,
> 
> I bumped into a serious issue while trying to analyse some texts in 
> Bulgarian language (with the tm package). I import a tab-separated csv 
> file, which holds a total of 22 variables, most of which are text cells 
> (not factors), using the read.delim function:
> 
> data<-read.delim("bigcompanies_ascii.csv",
>                  header=TRUE,
>                  quote="'",
>                  sep="\t",
>                  as.is=TRUE,
>                  encoding='CP1251',
>                  fileEncoding='CP1251')
> 
> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
> 
> I have my list of stop words written in a separate text file, one word 
> per line, which I read into R using the scan function:
> 
> stoplist<-scan(file='stoplist_ascii.txt',
>                 what='character',
>                 strip.white=TRUE,
>                 blank.lines.skip=TRUE,
>                 fileEncoding='CP1251',
>                 encoding='CP1251')
> 
> (also tried with UTF-8 here on a correspondingly encoded file)
> 
> I currently only test with a corpus based on the contents of just one 
> variable, and I construct the corpus from a VectorSource. When I run 
> inspect, all seems fine and I can see the text properly, with unicode 
> characters present:
> 
> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
>                     readerControl=list(language='bulgarian'))
> 
> However, no matter what I do - like which encoding I select - UTF-8 or 
> CP1251, which is the typical code page for Bulgarian texts, I cannot get 
> to remove the stop words from my corpus. The issue is present in both 
> Linux and Windows, and across the computers I use R on, and I don't 
> think it is related to bad configuration. Removal of punctuation, white 
> space, and numbers is flawless, but the inability to remove stop words 
> prevents me from further analysing the texts.
> 
> Has somebody had experience with languages other than English, and for 
> which there is no predefined stop list available through the stopwords 
> function? I will highly appreciate any tips and advice!
Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?

> Thanks in advance,
> Vince
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list