[Rd] Problem with UTF-8 text in the Rcmdr package

John Fox jfox at mcmaster.ca
Mon Sep 8 14:48:05 CEST 2008


Dear Brian,

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org]
On
> Behalf Of Prof Brian Ripley
> Sent: September-08-08 5:46 AM
> To: John Fox
> Cc: 'Jaro.Lajovic'; 'R-devel'
> Subject: Re: [Rd] Problem with UTF-8 text in the Rcmdr package
> 
> Unless Windows is running in CP1250 (the Slovenian encoding on Windows),
this
> is not expected to work.  I believe John tested in CP1252, and it just so
> happens that those characters are in the same place in CP1250 and CP1252.

Yes, that's right: My locale is English (Canada), which uses CP1252.

> 
> I get something different in CP1250, as pasting into the script window
also
> does not work.  But if I use the Unicode escapes, the result in the output
> Window is rendered correctly in the output window.
> 
> I think Jaro has put his finger on this: Tcl/Tk output thinks it is in
> Latin-2 and not CP1250, and s and z caron have different positions in
those
> two character sets.  Here is something I can reproduce easily: with XP set
to
> Slovenian:
> 
> > x <-"ČŠŽčšž"
> > x
> [1] "ČŠŽčšž"
> > charToRaw(x)
> [1] c8 8a 8e e8 9a 9e
> 
> which is correct for CP1250.  Now if I submit 'x' in the Rcmdr script
window,
> I get the wrong output in the output window.
> 
> And I've tracked that down to a bug in iconv (something we take from
libiconv
> on Windows): it does think the native encoding is Latin-2, not CP1252.
I'll
> put a workaround in R-devel and R-patched shortly.  That has other
potential
> ramifications that will take me longer to investigate, and correct thing
may
> be to fix iconv.

Thank you very much for tracking this down. 

Recall that there is also apparently a problem under Mac OS X, where the
characters appeared incorrectly in both the Script and Output windows.

Regards,
 John

> 
> On Sun, 7 Sep 2008, John Fox wrote:
> 
> > Dear Brian,
> >
> > Thank you for addressing the problem -- I was hoping that you would.
> >
> >> -----Original Message-----
> >> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> >> Sent: September-07-08 7:23 AM
> >> To: John Fox
> >> Cc: 'R-devel'; 'Jaro.Lajovic'
> >> Subject: Re: [Rd] Problem with UTF-8 text in the Rcmdr package
> >>
> >> The issue appears to be the Rcmdr output window and menus.  They are
> >> done using Tcl/Tk, not by R.  So this might be a problem in Tcl/Tk or
> >> the fonts it uses, or it might be problem with what Rcmdr passes to
> >> the tcltk package.
> >>
> >> We need the means to reproduce this (as per the posting guide):
> >
> > Jaro provides an example in one of his messages in my posting (though
> > it is slightly in error): If one enters
> >
> > cat("ČŠŽčšž\n")
> >
> > in the Rcmdr Script window, the characters are rendered correctly.
> > Executing this command (via the Submit button) produces the following
> > in the Output
> > window:
> >
> >> cat("??????\n")
> > ??????
> >
> > which actually appears as
> >
> >> cat("??\n")
> > ??
> >
> > This is under Windows Vista / R 2.7.2 / Rcmdr 1.4-0.
> >
> >>
> >> - what OSes are affected?  Does this occur in a UTF-8 locale on
> >> Linux, for example?
> >
> > I've now checked under Mac OS X and Linux Ubuntu, with the following
> > results:
> >
> > Under Mac OS X 10.5.4 / R 2.7.2 / Rcmdr 1.4-0 / Tcl/Tk 8.4
> >
> > cat("ČŠŽčšž\n") appears as cat("?????\n") in *both* the Script window
> > and the Output window.
> >
> > Under Ubuntu Linux 8.04 / R 2.7.0 / Rcmdr 1.4-0/ Tcl/Tk 8.5
> >
> > cat("ČŠŽčšž\n") appears *correctly* in *both* the Script window and
> > the Output window.
> >
> >>
> >> - in what locales?
> >
> > I'm afraid that I don't know how to check this short of changing the
> > locale for my Windows machine. I do observe the problem in Windows
> > when I start Rgui with language=sl.
> >
> >>
> >> - what versions of Tcl/Tk?  Note that shipped with Windows R changed
> >> between 2.5.1 and 2.7.x.
> >
> > Yes, and please see above, but if the problem were with Tcl/Tk, why
> > does this work in the Script window under Windows and in both Script
> > and Output under Ubuntu?
> >
> >>
> >> - Is this anything to do with translations?  I've not looked at how
> >> translations are done in Rcmdr, but if gettext() is used, the string
> >> passed to R for output is in the native encoding, so 'UTF-8
> >> characters' is incorrect.  It is possible that it is an iconv problem
> >> if the translations are supplied in UTF-8 and not Latin-2.
> >
> > Yes, the Rcmdr package uses gettext(). Could Jaro avoid the problem by
> > using
> > Latin-2 in preference to UTF-8?
> >
> >>
> >> There are far too many layers involved here to guess at what is going
on.
> >> My guess is that it ought to be possible to give a simple example of
> >> a string which can be output to the Rcmdr console and will be
> >> rendered incorrectly (together with a screen shot of how it is
rendered).
> >
> > Indeed, please see above. I've also attached a screenshot under
> > Windows, having started R with language=sl.
> >
> >>
> >> I think the characters referred to are the Unicode glyphs 's and z
> >> with caron', \u0161 and \u017E.  It seems that these will only be
> >> displayable in Rcmdr on Windows in a Latin-2 locale, which I do not
> >> have set up on Windows (but believe I could get installed).  However,
> >> examples using that (and the menus) seem to be correct in both
> >> sl_SI.iso88592 and sl_SI.utf8 on Linux, which suggests that this is
> >> probably not an R issue but a Tcl/Tk one.
> >
> > I'm above my depth with respect to these issues, but I do find it
> > curious that under Windows the characters appears correctly in the
> > Script window but not the Output window.
> >
> >>
> >> On Fri, 5 Sep 2008, John Fox wrote:
> >>
> >>> Dear list members,
> >>>
> >>> I've attached some email correspondence with Jaro Lajovic (with his
> >>> permission), detailing a problem with the Slovenian translation file
> >>> for the Rcmdr package.
> >>
> >> Unfortunately, it is not 'detailed', and we do need the details.
> >
> > I hope that the additional information in this message will supply at
> > least some of the necessary details.
> >
> > Thank you for your help,
> > John
> >
> >>
> >>> In brief, while certain UTF-8 characters used in Slovenian used to
> >>> appear properly in older versions of R, some characters do not
> >>> display properly in the Rcmdr menus and output window under R 2.7.x.
> >>> I've confirmed the problem with the current version of the Rcmdr
> >>> package
> >>> (1.4-0) and R 2.7.2 under Windows Vista.
> >>>
> >>> I've checked the R docs and NEWS file for changes to R, but wasn't
> >>> able to turn up anything that seemed relevant. Frankly, however, my
> >>> understanding of how various character sets are handled is only
partial.
> >>>
> >>> Any help would be appreciated.
> >>>
> >>> John
> >>>
> >>> ------------------------------
> >>> John Fox, Professor
> >>> Department of Sociology
> >>> McMaster University
> >>> Hamilton, Ontario, Canada
> >>> web: socserv.mcmaster.ca/jfox
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Jaro.Lajovic [mailto:Jaro.Lajovic at mf.uni-lj.si]
> >>> Sent: August-26-08 2:57 AM
> >>> To: John Fox
> >>> Subject: Re: Slovenian Rcmdr .po and .mo - and a problem
> >>>
> >>> Dear John,
> >>>
> >>>> That seems to imply that there's a change in R rather than in the
> >>>> Rcmdr that produced this problem. Do you notice the problem with
> >>>> any other packages that use translation or with R itself?
> >>>
> >>> As for other translated R packages, I am afraid I am not aware of any.
> >>> However, a quick test using cat with special characters:
> >>> cat "ČŠŽčšž\n"
> >>> reveals that the string prints OK in the R (2.7.1.) console. The
> >>> command line also shows OK in the Rcmdr Script window, but does not
> >>> display right in the Output window. Special chars also fail in the
> >>> Messages
> > window.
> >>>
> >>> Input (Script window) thus seems not to be affected, while the menu
> >>> system and output do not work properly.
> >>>
> >>> Thank you very much,
> >>> Jaro
> >>>
> >>>
> >>>> On Mon, 25 Aug 2008 21:54:43 +0200
> >>>>  "Jaro.Lajovic" <Jaro.Lajovic at mf.uni-lj.si> wrote:
> >>>>> Dear John,
> >>>>>
> >>>>>> One question though: I assume from your message that the previous
> >>>>>> version of the Rcmdr worked OK with R 2.7.1. Is that right?
> >>>>> No, the version 1.3-5 (that I still have with R 2.5.1) does not
> >>>>> work with R 2.7.1 either. So:
> >>>>>
> >>>>> Rcmdr 1.3-5 with R 2.5.1: works OK.
> >>>>> Rcmdr 1.3-5 with R 2.7.1: does not work properly.
> >>>>> Rcmdr 1.4-0 with R 2.7.1: does not work properly.
> >>>>>
> >>>>> Thank you in advance,
> >>>>> Jaro
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Mon, 25 Aug 2008 18:52:32 +0200  "Jaro.Lajovic"
> >>>>>> <Jaro.Lajovic at mf.uni-lj.si> wrote:
> >>>>>>> Dear John,
> >>>>>>>
> >>>>>>> Please find attached zipped Slovenian versions of .po (plain
> >>>>>>> text
> >>>>> and
> >>>>>>> UTF-8 coded text) and .mo files.
> >>>>>>>
> >>>>>>> However, there seems to be a problem I have not been able to
> >>>>> resolve.
> >>>>>>> While special characters display properly under R version 2.5.1
> >>>>> with
> >>>>>>> Rcmdr 1.3-5, they fail to display (= are substituted by black
> >>>>> blocks)
> >>>>>>> under R version 2.7.1 with the new Rcmdr 1.4-0. By the way: the
> >>>>> .mo
> >>>>>>> file of the ver. 1.3-5 copied to 1.4-0 also failed to display
> >>>>>>> properly.
> >>>>>>>
> >>>>>>> (An additional detail: three special characters that are used in
> >>>>> the
> >>>>>>> Slo version are c, s and z with hacek. c with hacek is not
> >>>>> affected,
> >>>>>>> it is just s and z with hacek that are not displayed OK.)
> >>>>>>>
> >>>>>>> Your advice will be much appreciated.
> >>>>>>>
> >>>>>>> With best regards,
> >>>>>>> Jaro
> >>>>
> >>>> --------------------------------
> >>>> John Fox, Professor
> >>>> Department of Sociology
> >>>> McMaster University
> >>>> Hamilton, Ontario, Canada
> >>>> http://socserv.mcmaster.ca/jfox/
> >>>>
> >>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >> --
> >> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> >> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >> University of Oxford,             Tel:  +44 1865 272861 (self)
> >> 1 South Parks Road,                     +44 1865 272866 (PA)
> >> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >
> 
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list