[Rd] deparse() and UTF-8 strings

Tue Feb 22 10:53:33 CET 2022

I just saw a commit accidentally that adds iconv() support for the c99
\u escapes, which might or might not be accidental:
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07

In any case, this is great, and very useful to have cross-platform for
it. Thank you!

Would it make sense to generate braced 4-digit \uxxxx sequences, to
make sure that they don't mix with the surrounding text?
I.e. \u{xxxx}? (Plus update the 6 to 8 twice.)
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R746-R747

Also, it seems that we need a capital \U for the 8-digit sequences here:
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R753

Thank you again,
Gabor

On Mon, Feb 21, 2022 at 2:17 PM Brodie Gaslam <brodie.gaslam using yahoo.com> wrote:
>
> I'm not R-core, but happen to have run into this issue.
>
> I think this makes sense conceptually, and have had the same thought
> myself.  One implementation challenge is that the parser has a special
> branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such
> input to 10K wide characters, so the parser would need to be modified in
> order to make this a general solution:
>
>  > parse(text=sprintf('"%s"', strrep("G\\u00e1bor", 2000)))
> Error in parse(text = sprintf("\"%s\"", strrep("G\\u00e1bor", 2000))) :
>    string at line 1 containing Unicode escapes not in this locale
> is too long (max 10000 chars)
>
> Such strings are rare so maybe an interim solution is just to allow it
> for deparsing of shorter strings.  The parser modification itself would
> also have the benefit of speeding up parsing of strings without Unicode
> escapes.
>
> Best,
>
> B.
>
>
> On 2/21/22 5:33 AM, Gábor Csárdi wrote:
> > I am wondering if it would make sense to produce \u escaped strings in
> > deparse() for UTF-8 input. Currently we have (in R-devel):
> >
> > x <- "G\u00e1bor"
> > Sys.setlocale("LC_ALL", "C")
> > #> [1] "C/C/C/C/C/en_US.UTF-8"
> >
> > deparse(x)
> > #> [1] "\"G<U+00E1>bor\""
> >
> > charToRaw(deparse(x))
> > #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
> >
> > Is there a reason why this is preferable instead of returning
> >
> > "\"G\\u00e1bor\""
> >
> > i.e.
> >
> > charToRaw("\"G\\u00e1bor\"")
> > #>  [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
> >
> > Returning the \u escaped form would make deparse() the inverse of
> > parse(), at least in this respect.
> >
> > Thank you,
> > Gabor
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel