[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)

Sun Jun 1 21:50:59 CEST 2008

On 31.05.2008, at 00:11, Prof Brian Ripley wrote:

> On Fri, 30 May 2008, Duncan Murdoch wrote:
>> But I think with Brian Ripley's work over the last while, R for  
>> Windows actually handles utf-8 pretty well.  (It might not guess  
>> at that encoding, but if you tell it that's what you're using...)

Yes. I already mentioned that there was a big step from R 2.6 to R  
2.7 for Windows regarding the support of UTF-8.

> R passes around, prints and plots UTF-8 character data pretty well,  
> but it translates to the native encoding for almost all character- 
> level manipulations (and not just on Windows).  ?Encoding spells  
> out the exceptions (and I think the original poster had not read  
> it).  As time goes on we may add more, but it is really tedious  
> (and somewhat error-prone) to have multiple paths through the code  
> for different encodings (and different OSes do handle these  
> differently -- Windows' use of UTF-16 means that one character may  
> not be one wchar_t).

R is becoming more and more popular amongst philologists, linguistics  
etc. It is very nice to have one software environment to gather,  
analyze, and visualize data based on texts. But, e.g. linguists are  
dealing very often with more than one language at the same time.  
That's why they have to use an Unicode encoding.
In R they have to use any functions dealing with characters, like  
nchar, strsplit, grep/gsub, to lower/upper case etc.
These functions are, more or less, based on the underlying locale  
settings. But why?

It is a very very painful task to write functions for different  
encodings on different platforms. Thus I wonder whether it would be  
possible to switch internally to one Unicode encoding. If one  
considers e.g. the memory usage UTF-8 would be an option. Of course,  
such a change will be REALLY a BIG challenge in terms of effort,  
speed, compatibility, etc. This would also mean to avoid the usage of  
system libraries.
Maybe this would be a task for R 4.0 or it will be my eternal private  
dream :)

OK. Let me be a bit more realistic.
An other issue is the used regular expression engine. On a Mac or  
UNIX machine one can set a UTF-8 locale. Fine. But these locales  
aren't available under Windows (yet?). Maybe it's worth to have a  
look at other regexp engines like Oniguruma ( http://www.geocities.jp/ 
kosako3/oniguruma/ ). It supports, among others, all Unicode  
encodings. It is used in many applications. I do not know how  
difficult it will be to implement such a library in R. But this would  
solve, I guess, 80%  of the problems of R users who are interested in  
text analyzing. nchar, strsplit, grep etc. could make usage of it.  
Maybe one could write such a package for Windows (maybe also for Mac/ 
UNIX, because Oniguruma has some very nice additional features). Of  
course, a string should be piped as an UTF-8 byte stream to the  
Oniguruma lib, and I do not know whether this is easily possible in R  
for Windows.

Once again, thanks for all the effort done to set up such a wonderful  
piece of software.

--Hans