[Rd] Error in substring: invalid multibyte string

Sat Jun 27 11:12:42 CEST 2020

On Fri, 26 Jun 2020 15:57:06 -0700
Toby Hocking <tdhock5 using gmail.com> wrote:

>invalid multibyte string at '<e4>gel-A<6b>iyoshi'

>https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html

The server says that the text is UTF-8:

curl -sI \
 https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
 grep Content-Type
# Content-Type: text/html; charset=UTF-8

But it's not, at least not all of it. If you ask readLines to mark
the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
mojibake and invalid multi-byte characters:

x <- readLines(
 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
 encoding = 'latin1'
)[28]
substr(x, 1, 100)
# [1] "<I>Jens Oehlschlägel-Akiyoshi"

The behaviour we observe when encoding = 'latin1' is not specified
results from returned lines having "unknown" encoding. The substr()
implementation tries to interpret such strings according to multi-byte C
locale rules (using mbrtowc(3)). On my system (yours too, probably, if
it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
and this Latin-1 string does not result in valid code points when
decoded as UTF-8.

-- 
Best regards,
Ivan