[Rd] Problems with sub() due to inability to set encoding of ASCII strings

Thu Sep 29 18:38:16 CEST 2016

I'm encountering a problem using sub() on strings in R 3.3.1 in
Windows, using both RGui and RStudio. The problem happens when the
starting string is ASCII, but the replacement string is UTF-8.

If we create an ASCII string x1, its encoding is marked as "unknown",
and doing a sub() on that string with a UTF-8 replacement results in
weird characters:

x1 <- "a b c"
Encoding(x1)
# [1] "unknown"

replacement <- "中文"
Encoding(replacement)
# [1] "UTF-8"

(y1 <- sub("a", replacement, x1))
#[1] "ä¸æ–‡ b c"
Encoding(y1)
# [1] "unknown"

If the starting string x2 has Chinese characters, it'll be marked as
UTF-8, and replacement works fine:

x2 <- "a b c 中文"
Encoding(x2)
# [1] "UTF-8"

(y2 <- sub("a", replacement, x2))
# [1] "中文 b c 中文"
Encoding(y2)
# [1] "UTF-8"

It seems like the solution should be to mark the starting string as
UTF-8, but apparently it doesn't work if the string is ASCII, and so
the sub() still gives weird characters:

# Not possible to mark x1 as UTF-8
Encoding(x1) <- "UTF-8"
Encoding(x1)
# [1] "unknown"

(y3 <- sub("a", replacement, x1))
# [1] "ä¸æ–‡ b c"
Encoding(y3)
# [1] "unknown"

It is possible to tell R that the final string y3 is UTF-8, but it
doesn't seem like this should be necessary:

Encoding(y3) <- "UTF-8"
y3
# [1] "中文 b c"

Is there some way to mark the starting string x1 as UTF-8 so that the
result of sub() comes out marked as UTF-8? If the inputs are both
UTF-8, it shouldn't be necessary to explicitly tell R that the output
is also UTF-8.

-Winston