[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)

Fri May 30 17:14:54 CEST 2008

Hi all

Four questions regarding Unicode.

Three Windows questions. I am using

- a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3);
- the following R version:

> R.version
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          7.0
year           2008
month          04
day            22
svn rev        45424
language       R
version.string R version 2.7.0 (2008-04-22)

- the following locale:
> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

# I loaded the file
# <http://www.linguistics.ucsb.edu/faculty/stgries/teaching/russ_corp.txt>
# into R, and this works fine.
x<-scan(choose.files(), what="char", sep="\n", quote="",
comment.char="", encoding="UTF-8")

# My problems are the following:
# 1 strsplit

# This does not work:
words.1<-unlist(strsplit(corpus.file, "[-!;:\'\"\\?\\. ]+", perl=T))

# - words.1[173] should be "фирме", as in corpus.file[6]
# but it is "Ñ„Ð¸Ñ€Ð¼Ðµ"
# - words.1[208] should be "Торговли", as in corpus.file[13]
# but it is "Ð¢Ð¾Ñ€Ð³Ð¾Ð²Ð»Ð¸"
# - words.1[214] should be "клиентов", as in corpus.file[14]
# but it is "Ð¢Ð¾Ñ€Ð³Ð¾Ð²Ð»Ð¸"

# 2 entering Unicode characters into R: I want to search for,
# say, "для". So I try to define it as follows,
# but this doesn't work:
(x123<-"\u0434\u043b\u044F")

# I can define each individual character
(x1<-"\u0434"); (x2<-"\u043b"); (x3<-"\u044F")

# and each pair of character
(x12<-"\u0434\u043b")
(x13<-"\u0434\u044F")
(x23<-"\u043b\u044F")

# but not all three ... the last one gets skipped.
# why's that and how do I do it?

# 3 defining Unicode character ranges: in each of the following,
# the last bracket does not get included (even if it gets defined
# as a Unicode character, too):
russ.char.yes<-"[\u0401\u0410-\u044F\u0451]" # all Russian Cyrillics
russ.char.no<-"[^\u0401\u0410-\u044F\u0451]" # other characters
russ.char.capit<-"[\u0410-\u042F\u0451]" # capital Russian Cyrillics
russ.char.small<-"[\u0430-\u044F\u0401]" # small Russian Cyrillics

# I can do that all on Linux, but this arises in a context where
# many other character processing issues are explained for Mac,
# Linux, *and* Windows, and I'd hate to have to say "this one
# thing, you can't do on Windows"

One Linux question. I am using Ubuntu Hardy Heron:

> sessionInfo()
R version 2.7.0 (2008-04-22)
i486-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

# strange(?) behavior of word boundary characters:
# I understand why these work ...
grep("\\bмолод", "а молодость", perl=F, value=T) # OK
# [1] "а молодость"
gsub("\\bмолод", ">XX<", "а молодость", perl=F) # OK
# [1] "а >XX<ость"

# but why does "\\b" not work with perl=T?
grep("\\bмолод", "а молодость", perl=T, value=T) # FAIL
# character(0)
gsub("\\bмолод", ">XX<", "а молодость", perl=T) # FAIL
# [1] "а молодость"

Any pointers would be much appreciated and acknowledged ...
STG