[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

Thu Dec 11 21:28:54 CET 2014

On Thu, Dec 11, 2014 at 10:47 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> On 11/12/2014 12:59 PM, Henrik Bengtsson wrote:
>>
>> SUGGESTION:
>> Would it make sense if install.packages() and friends always use an
>> "ascii"(*) encoding when parse():ing R package source code files?
>
>
> I think that would be a step backwards.  It would be better to accept other
> encodings.  As an English speaker this isn't a big deal to me, but users of
> other languages may want to have messages and variable names in their native
> language, and ASCII might not be enough for that.

Thanks for the feedback.  While I'll probably agree with you that R
packages should support other source code encodings than ASCII, that
would require a change in the specifications and design.  What I'm
proposing is (just) an adjustment to the implementation to meet the
current specs and design.

>
> On the other hand, I think it's quite reasonable to require a declared
> encoding if anything other than ASCII is used, and possibly to fail for some
> encodings.  It is probably also reasonable to at least warn when non-ASCII
> characters are used in strings in packages on CRAN, as many users can't
> display all characters.

That would be a reasonable extension of the design, which would be
backward compatible with the current design, i.e. if encoding for the
source code is not declared, then it is assumed to be ASCII.

Source code comments are special, because by the current design
('Writing R Extensions'), it somehow leaves it open to use any type of
encoding.  If I read it freely, it could even be that you can use
different encoding for different comments in the same file (which is
not unlikely to occur considered cut'n'paste and open-source
licenses).  If other encodings are to be supported, then I see two
ways forward:

1. Have R completely ignore what's in the comments (what follows #
until the newline) such that encoding does not matter, or
2. require the same encoding for the source code comments as the rest
of the code.

As I see it, today's design falls (could fall?) under 1, but the
implementation does not go all the way to support it.

/Henrik

PS. It should be emphasized that this is about R packages. AFAIK, you
can already now source() code written in any encoding, e.g.
> raw <- as.raw(c(
+  0xcf, 0x80, 0x20, 0x3c, 0x2d, 0x20, 0x70, 0x69, 0x0a,
+  0x70, 0x72, 0x69, 0x6e, 0x74, 0x28, 0xcf, 0x80, 0x29, 0x0a
+ ))
> writeBin(raw, con="pi.R")
> source("pi.R", encoding="UTF-8")
[1] 3.141593

>
> Duncan Murdoch
>>
>>
>> I believe this should be safe, because R code files should be in ASCII
>> [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
>> you may use other characters.  This is from Section 'Package
>> subdirectories' in 'Writing R Extensions':
>>
>> "Only ASCII characters (and the control characters tab, formfeed, LF
>> and CR) should be used in code files. Other characters are accepted in
>> comments, but then the comments may not be readable in e.g. a UTF-8
>> locale. Non-ASCII characters in object names will normally fail when
>> the package is installed. Any byte will be allowed in a quoted
>> character string but \uxxxx escapes should be used for non-ASCII
>> characters. However, non-ASCII character strings may not be usable in
>> some locales and may display incorrectly in others."
>>
>> Since comments are dropped by parse(), their actual content does not
>> matter, and the rest of the code should be in ASCII.
>>
>> (*) It could be that the specific encoding "ascii" is not cross
>> platforms. If so, is there another way to specify a pure ASCII
>> encoding?
>>
>>
>>
>> BACKGROUND:
>> If a user/system sets the 'encoding' option at startup, it may break
>> package installations from source if the package has source code
>> comments with non-ASCII characters.  For example,
>>
>> $ mkdir foo; cd foo
>> $ echo "options(encoding='UTF-8')" > .Rprofile
>> $ R --vanilla
>> > install.packages("R.oo", type="source")
>>
>> > install.packages("R.oo", type="source")
>> Installing package into 'C:/Users/hb/R/win-library/3.2'
>> (as 'lib' is unspecified)
>> --- Please select a CRAN mirror for use in this session ---
>> trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
>> Content type 'application/x-gzip' length 394545 bytes (385 KB)
>> opened URL
>> downloaded 385 KB
>>
>> * installing *source* package 'R.oo' ...
>> ** package 'R.oo' successfully unpacked and MD5 sums checked
>> ** R
>> Warning in parse(outFile) :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>> R.oo'
>> ** inst
>> ** preparing package for lazy loading
>> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE)
>> :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>> R.oo'
>> ** help
>> [...]
>>
>> (This can be an extremely time consuming task to troubleshoot,
>> particularly if reported to a package maintainer not having access to
>> the original system).
>>
>> FYI, setting it only in the session is alright:
>>
>> > options(encoding="UTF-8")
>> > install.packages("R.oo", type="source")
>>
>> because install.packages() launches a separated R process for the
>> installation and it's only then the startup code becomes an issue.
>>
>>
>> TROUBLESHOOTING:
>> My understanding for the
>>
>> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE)
>> :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>>
>> is that this happens when there is a non-ASCII character in one of the
>> source-code comments (*) with a bit pattern matching a multi-byte
>> UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
>> instance, consider a source code comment with an acute accent:
>>
>> > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e,
>> > 0x74, 0x0a))
>> > writeBin(raw, con="foo.R")
>> > code <- readLines("foo.R")
>> > code
>> [1] "# étudiant"
>>
>> > options(encoding="UTF-8")
>> > parse("foo.R")
>> Warning message:
>> In readLines(file, warn = FALSE) :
>>    invalid input found on input connection 'foo.R'
>>
>> > options(encoding="ascii")
>> > parse("foo.R")
>> expression()
>>
>> Reason for the "invalid input": The bit pattern for raw[3:5], is:
>>
>> > R.utils::intToBin(raw[3:5])
>> [1] "11101001" "01110100" "01110101"
>>
>> The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx",
>> which according to UTF-8 should be followed by two more bytes with bit
>> patterns "10xxxxxx" and "10xxxxx"
>> [http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
>> not match those, it's an invalid UTF-8 byte sequence.  So, technically
>> this does not happen for all comments using acute accents, but it's
>> very likely.  More generally, a multi-byte UTF-8 sequence is expected
>> when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered.
>> Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
>> characters with this bit pattern for many "Latin-N" encodings, which
>> I'd assume is still in dominant use by many developers.
>>
>> So, since options(encoding="UTF-8") was set at startup, that is also
>> the encoding that R tries to follow.  My suggestion is that it seems
>> that R should be able to always use a pure-ASCII encoding when parsing
>> R code in packages, because that is what 'Writing R Extensions' says
>> we should use in the first place.
>>
>> /Henrik
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>