[Rd] latin1,utf-8...encoding and data

Martin Maechler maechler at stat.math.ethz.ch
Thu Oct 19 15:26:55 CEST 2006


>>>>> "Stéphane" == Stéphane Dray <dray at biomserv.univ-lyon1.fr>
>>>>>     on Thu, 19 Oct 2006 09:46:49 +0200 writes:

    Stéphane> Thanks a lot for this clear answer. So there is no way to preserve our 
    Stéphane> french cultural exception (accented characters), 

I agree that there are many French cultural exceptions ;-) 
--- and as a Swiss, I highly estimate several of them ---
however "accented" characters (with the appropriate meaning of "accented")
are not at all a French exception, rather almost a continental
European one {as long as we are staying in the "latin" alphabet
context}.  If I think of what I know of Europe, the only
country/language *not* using some version of "accented"
characters are the British and (I think) the Dutch/Flamish.
Everyone else (? probably I forgot some, and don't know about others
like gaelic,...)  has some kind of accents...

I agree with Stéphane that this is unfortunate for quite a few
of us, and it came as a big surprise to me when I first heard
about this from Brian.  .. aah, life was easy when we western
chauvinists could behave as if the whole relevant part of the
world was happy with iso-latin1...

Martin 


    Stéphane> if we want to be international... I have thought
    Stéphane> that the inclusion of a parameter encoding in data
    Stéphane> function (e.g. data(mydata,encoding="latin1"))
    Stéphane> like in the function 'file' could be an way to
    Stéphane> solve the problem. Apparently, the problem is much
    Stéphane> more complicated...

    Stéphane> Sincerely.


    Stéphane> Prof Brian Ripley wrote:

    >> Only ASCII letters are portable: those accented characters do not even 
    >> exist in many of the encodings used for R, e.g. Russian and Japanese 
    >> on Windows machines.
    >> 
    >> There is no way to associate an encoding with a character string in 
    >> R.  We considered it, but it would have had severe back-compatibility 
    >> problems and little advantage (you cannot display non-ASCII character 
    >> strings portably: even if you have a Unicode encoding you still need 
    >> to select a suitable font).
    >> 
    >> 'B. Ripley' (sic)
    >> 
    >> 
    >> On Wed, 18 Oct 2006, Stéphane Dray wrote:
    >> 
    >>> Hello,
    >>> I have some questions concerning encoding and package distribution. We
    >>> develop the ade4 package. For some data sets included in the package,
    >>> there are accentued character (e.g. é,è...). The data sets have been
    >>> saved using latin1 encoding, but some of us use utf-8 and can not see
    >>> some data sets which contains accented chracters.
    >>> e.g:
    >>> 
    >>> librarry(ade4)
    >>> data(rankrock)
    >>> rankrock
    >>> 
    >>> in this case, characters are in rownames. Other data sets have such
    >>> characters in data (e.g. levels of factors..). A solution is to use
    >>> iconv... this is quite easy for us but perhaps more difficult for a user
    >>> which can have no idea of the problem. This problem is quite marginal
    >>> for the moment but some linux distribution are utf-8 by default (e.g.
    >>> ubuntu) and I suppose that the problem will be more and more present in
    >>> the future.
    >>> 
    >>> So we wonder if there is a proper way to code and save these data sets.
    >>> I have found some documents of B. Ripley and this note :
    >>> 
    >>> http://developer.r-project.org/210update.txt
    >>> 
    >>> -  Names in data objects (e.g. in .rda files) are problematic.  It
    >>> is likely that by release time these will be treated as in
    >>> Latin-1.
    >>> 
    >>> If I am correct, I did not find an answer to this problem.
    >>> 
    >>> What are the plans of R gurus on this question ?
    >>> Thanks a lot.
    >>> Sincerely.
    >>> 
    >>> Please add my adress in answers as I am not subsciber of this list.




More information about the R-devel mailing list