[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

Thu Dec 11 19:47:10 CET 2014

On 11/12/2014 12:59 PM, Henrik Bengtsson wrote:
> SUGGESTION:
> Would it make sense if install.packages() and friends always use an
> "ascii"(*) encoding when parse():ing R package source code files?

I think that would be a step backwards.  It would be better to accept 
other encodings.  As an English speaker this isn't a big deal to me, but 
users of other languages may want to have messages and variable names in 
their native language, and ASCII might not be enough for that.

On the other hand, I think it's quite reasonable to require a declared 
encoding if anything other than ASCII is used, and possibly to fail for 
some encodings.  It is probably also reasonable to at least warn when 
non-ASCII characters are used in strings in packages on CRAN, as many 
users can't display all characters.

Duncan Murdoch
>
> I believe this should be safe, because R code files should be in ASCII
> [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
> you may use other characters.  This is from Section 'Package
> subdirectories' in 'Writing R Extensions':
>
> "Only ASCII characters (and the control characters tab, formfeed, LF
> and CR) should be used in code files. Other characters are accepted in
> comments, but then the comments may not be readable in e.g. a UTF-8
> locale. Non-ASCII characters in object names will normally fail when
> the package is installed. Any byte will be allowed in a quoted
> character string but \uxxxx escapes should be used for non-ASCII
> characters. However, non-ASCII character strings may not be usable in
> some locales and may display incorrectly in others."
>
> Since comments are dropped by parse(), their actual content does not
> matter, and the rest of the code should be in ASCII.
>
> (*) It could be that the specific encoding "ascii" is not cross
> platforms. If so, is there another way to specify a pure ASCII
> encoding?
>
>
>
> BACKGROUND:
> If a user/system sets the 'encoding' option at startup, it may break
> package installations from source if the package has source code
> comments with non-ASCII characters.  For example,
>
> $ mkdir foo; cd foo
> $ echo "options(encoding='UTF-8')" > .Rprofile
> $ R --vanilla
> > install.packages("R.oo", type="source")
>
> > install.packages("R.oo", type="source")
> Installing package into 'C:/Users/hb/R/win-library/3.2'
> (as 'lib' is unspecified)
> --- Please select a CRAN mirror for use in this session ---
> trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
> Content type 'application/x-gzip' length 394545 bytes (385 KB)
> opened URL
> downloaded 385 KB
>
> * installing *source* package 'R.oo' ...
> ** package 'R.oo' successfully unpacked and MD5 sums checked
> ** R
> Warning in parse(outFile) :
>    invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/
> R.oo'
> ** inst
> ** preparing package for lazy loading
> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
>    invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/
> R.oo'
> ** help
> [...]
>
> (This can be an extremely time consuming task to troubleshoot,
> particularly if reported to a package maintainer not having access to
> the original system).
>
> FYI, setting it only in the session is alright:
>
> > options(encoding="UTF-8")
> > install.packages("R.oo", type="source")
>
> because install.packages() launches a separated R process for the
> installation and it's only then the startup code becomes an issue.
>
>
> TROUBLESHOOTING:
> My understanding for the
>
> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
>    invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>
> is that this happens when there is a non-ASCII character in one of the
> source-code comments (*) with a bit pattern matching a multi-byte
> UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
> instance, consider a source code comment with an acute accent:
>
> > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e, 0x74, 0x0a))
> > writeBin(raw, con="foo.R")
> > code <- readLines("foo.R")
> > code
> [1] "# étudiant"
>
> > options(encoding="UTF-8")
> > parse("foo.R")
> Warning message:
> In readLines(file, warn = FALSE) :
>    invalid input found on input connection 'foo.R'
>
> > options(encoding="ascii")
> > parse("foo.R")
> expression()
>
> Reason for the "invalid input": The bit pattern for raw[3:5], is:
>
> > R.utils::intToBin(raw[3:5])
> [1] "11101001" "01110100" "01110101"
>
> The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx",
> which according to UTF-8 should be followed by two more bytes with bit
> patterns "10xxxxxx" and "10xxxxx"
> [http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
> not match those, it's an invalid UTF-8 byte sequence.  So, technically
> this does not happen for all comments using acute accents, but it's
> very likely.  More generally, a multi-byte UTF-8 sequence is expected
> when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered.
> Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
> characters with this bit pattern for many "Latin-N" encodings, which
> I'd assume is still in dominant use by many developers.
>
> So, since options(encoding="UTF-8") was set at startup, that is also
> the encoding that R tries to follow.  My suggestion is that it seems
> that R should be able to always use a pure-ASCII encoding when parsing
> R code in packages, because that is what 'Writing R Extensions' says
> we should use in the first place.
>
> /Henrik
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel