[R] Problem comparing two strings

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Mon Nov 18 16:45:55 CET 2019


On 18/11/2019 10:11 a.m., Björn Fisseler wrote:
> Hello,
> 
> I'm struggling comparing two strings, which come from different data
> sets. This strings are identical: "Alexander Jäger"
> 
> But when I compare these strings: string1 == string2
> the result is FALSE.
> 
> Looking at the raw bytes used to encode the strings, the results are
> different:
> 
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
> 
> string2 comes from the file names of different files on my machine
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
> 
> It's obviously the umlaut "ä" in this example which is encoded with two
> respectively three bytes. The question is how to change this? This
> problem makes it impossible to join the two data sets based on the
> names. I already checked the settings on my machine: Sys.getlocale()
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
> Changing/forcing the encoding of the data didn't bring the results I
> expected.
> 
> What else can I try?

Characters like ä have two (or more) representations in Unicode:  a 
single code point, or the code point for "a" followed by a code point 
that says "add an umlaut".

If you want to compare strings, you need a consistent representation. 
This is called normalizing the string.

There are several possible normalizations; for your purposes it doesn't 
matter which one you use, as long as you use the same normalization for 
both strings.  See 
<https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html> 
for details.

In R, there are several functions that do the normalization for you. 
Two are utf8::utf8_normalize or stringi::stri_trans_nfc.  So you'd want 
something like

   library(utf8)
   string1 <- utf8_normalize(string1)
   string2 <- utf8_normalize(string2)
   string1 == string2  # Should now work as expected

Duncan Murdoch



More information about the R-help mailing list