[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Ventseslav Kozarev, MPP vinceeval at gmail.com
Tue Apr 9 09:10:26 CEST 2013


Hi,

I bumped into a serious issue while trying to analyse some texts in 
Bulgarian language (with the tm package). I import a tab-separated csv 
file, which holds a total of 22 variables, most of which are text cells 
(not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                 header=TRUE,
                 quote="'",
                 sep="\t",
                 as.is=TRUE,
                 encoding='CP1251',
                 fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word 
per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
                what='character',
                strip.white=TRUE,
                blank.lines.skip=TRUE,
                fileEncoding='CP1251',
                encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one 
variable, and I construct the corpus from a VectorSource. When I run 
inspect, all seems fine and I can see the text properly, with unicode 
characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                    readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or 
CP1251, which is the typical code page for Bulgarian texts, I cannot get 
to remove the stop words from my corpus. The issue is present in both 
Linux and Windows, and across the computers I use R on, and I don't 
think it is related to bad configuration. Removal of punctuation, white 
space, and numbers is flawless, but the inability to remove stop words 
prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for 
which there is no predefined stop list available through the stopwords 
function? I will highly appreciate any tips and advice!

Thanks in advance,
Vince



More information about the R-help mailing list