[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

Mon Sep 9 21:56:48 CEST 2013

Le lundi 09 septembre 2013 à 13:41 -0400, Simon Urbanek a écrit :
> On Sep 9, 2013, at 12:46 PM, Milan Bouchet-Valat wrote:
> 
> > Le lundi 09 septembre 2013 à 13:59 +0100, Prof Brian Ripley a écrit :
> >> On 09/09/2013 09:49, Milan Bouchet-Valat wrote:
> >>> Hi!
> >>> 
> >>> I experience an error with an invalid UTF-8 character passed to
> >>> gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
> >>> default) no error happens. (The character itself was read from an
> >>> invalid HTML file.) Illustration of the error:
> >>> 
> >>> gsub("a", "", "\U3e3965", perl=FALSE)
> >>> # [1] "\U3e3965"
> >>> gsub("a", "", "\U3e3965", perl=TRUE)
> >>> # Error in gsub("a", "", "\U3e3965", perl = TRUE) :
> >>> #   input string 1 is invalid UTF-8
> >>> 
> >>> 
> >>> The error message in the second command seems to come from
> >>> src/main/grep.c:1640 (in do_gsub):
> >>> if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);
> >>> 
> >>> utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
> >>> described in src/extra/pcre/pcre_valid_utf8.c.
> >>> 
> >>> 
> >>> 
> >>> Even more problematic/interesting is the fact that iconv() does not
> >>> consider the above character as invalid, as it does not replace it when
> >>> using the sub argument.
> >>>> iconv("a\U3e3965", sub="")
> >>> [1] "a\U003e3965"
> >>> 
> >>> On the contrary, an invalid sequence such as \xff is substituted:
> >>> iconv("a\xff", sub="")
> >>> # [1] "a"
> >>> 
> >>> This makes it difficult to sanitize the string before passing it to
> >>> gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
> >>> and where. Should iconv() and PCRE be made to agree on the definition of
> >>> an invalid UTF-8 sequence?
> >> 
> >> iconv() is using a system service: read its help page.  So you know 
> >> where to report this ....
> > Yeah, but why is "\U003e3965" considered valid by gsub(perl=TRUE) and
> > printed as a character on Windows 7, and not on Linux? Do you think this
> > is a separate bug on Windows?
> > 
> 
> As pre RFC 3629 UTF-8 does not support characters beyond U+10FFFF so
> your U+3E3965 is not encodable in UTF-8 (it is encodable in the older
> scheme).
>
> The trick is that Windows doesn't have a UTF-8 locale and only
> supports 16-bit, so it truncates the content to U+3965:
> 
> > charToRaw("\U3e3965")
> [1] e3 a5 a5
> 
> That's why it seemingly works there.
Good catch!

So this is just a matter of iconv() being too tolerant, while PCRE is
right. As Brian Ripley said, I know where to report the bug.

Regards