[Rd] locales and readLines

Martin Morgan mtmorgan at fhcrc.org
Mon Sep 3 21:05:26 CEST 2007


Thank you very much for explaining this. I had indeed overlooked the
use of encoding in 'file'. I also appreciate how unsatisfactory
guessing at the encoding can be, and that scanning the entire file is
not appropriate for large files or general connections.

Sorry that 'burden' came across as negative, more along the lines of
'burden of responsibility for handling the inputs the package
developer implies they'll handle'. Much better than the burden of
saying 'sorry, no can do'.

Thanks again,

Martin

Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:

> I think you need to delimit a bit more what you want to do.  It is
> difficult in general to tell what encoding a text file is in, and very
> much harder if this is a data file containing only a small proportion
> of non-ASCII text, which might not even be words in a human language
> (but abbreviations or acronyms).
>
> If you have experience with systems that do try to guess (e.g. Unix
> 'file') you will know that they are pretty fallible.  There are Perl
> modules available, for example: I checked Encode::Guess which says
>
>     ·   Because of the algorithm used, ISO-8859 series and other single-
>         byte encodings do not work well unless either one of ISO-8859 is
>         the only one suspect (besides ascii and utf8).
>
>     ·   Do not mix national standard encodings and the corresponding vendor
>         encodings.
>
>     It is, after all, just a guess.  You should alway be explicit when it
>     comes to encodings.  But there are some, especially Japanese, environ-
>     ment that guess-coding is a must.  Use this module with care.
>
>
> I think you may have missed that the main way to specify an encoding
> for a file is
>
> readLines(file("fn", encoding="latin2"))
>
> and not the encoding arg to readLines (although the help page is quite
> clear that the latter does not re-encode).  The latter only allows
> UTF-8
> and latin1.
>
> The author of a package that offers facilities to read non-ASCII text
> does need to offer the user a way to specify the encoding.  I think
> suggesting that is 'an extra burden' is exceedingly negative: you
> could rather be thankful that R provides the facilities these days to
> do so.  And if the package or its examples contains non-ASCII
> character strings, it is de rigeur for the author to consider how it
> might work on other people's systems.
>
> Notice that source() already has some of the 'smarts' you are asking
> about if 'file' is a file and not a connection, and you could provide
> a similar wrapper for readLines.  That is useful either when the user
> can specify a small set of possible encodings or when such a set can
> be deduced from the locale.  If the concern is that file might be
> UTF-8 or latin1, this is often a good guess (latin1 files can be valid
> UTF-8 but rarely are). However, if you have Russian text which might
> be in one of the several 8-bit encodings, the only way I know to
> decide which is to see if they make sense (and if they are acronyms,
> they may in all the possible encodings).
>
> BTW, to guess an encoding you really need to process all the input, so
> this is not appropriate for general connections, and for large files
> it might be better to do it external to R, e.g. via Perl etc.
>
> I would say minimal good practice would be to
>
> - allow the user to specify the encoding of text files.
> - ensure you have specified the encoding of all non-ASCII data in your
>    package (which includes documentation, for example).
>
> I'd leave guessing to others: as
> http://www.cs.tut.fi/~jkorpela/chars.html says,
>
>    It is hopefully obvious from the preceding discussion that a sequence of
>    octets can be interpreted in a multitude of ways when processed as
>    character data. By looking at the octet sequence only, you cannot even
>    know whether each octet presents one character or just part of a
>    two-octet presentation of a character, or something more complicated.
>    Sometimes one can guess the encoding, but data processing and transfer
>    shouldn't be guesswork.
>
>
>
> On Fri, 31 Aug 2007, Martin Morgan wrote:
>
>> R-developers,
>>
>> I'm looking for some 'best practices', or perhaps an upstream solution
>> (I have a deja vu about this, so sorry if it's already been asked).
>> Problems occur when a file is encoded as latin1, but the user has a
>> UTF-8 locale (or I guess more generally when the input locale does not
>> match R's).  Here are two examples from the Bioconductor help list:
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html
>>
>> (the relevant command is library(GEOquery); gse <- getGEO('GSE94'))
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html
>>
>> I think solutions are:
>>
>> * Specify the encoding in readLines.
>>
>> * Convert the input using iconv.
>>
>> * Tell the user to set their locale to match the input file (!)
>>
>> Unfortunately, these (1 & 2, anyway) place extra burden on the package
>> author, to become educated about locales, the encoding conventions of
>> the files they read, and to know how R deals with encodings.
>>
>> Are there other / better solutions? Any chance for some (additional)
>> 'smarts' when reading files?
>>
>> Martin
>>
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org



More information about the R-devel mailing list