[Rd] encoding question again

Simon Urbanek simon.urbanek at r-project.org
Sat Dec 29 17:42:53 CET 2007


Oops, this was supposed to be a private reply ;) - sorry about the  
noise. The essence in English:
JGR uses all strings in UTF-8 encoding, but the system locale reports  
CP1252 which impedes automatic conversions (because R doesn't know  
that everything is UTF-8). Specific conversion via iconv works as  
expected (see the example below).

Cheers,
Simon

On Dec 29, 2007, at 11:11 AM, Simon Urbanek wrote:

> Hallo Matthias,
>
> On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
>
>> Hi, simon,
>> 	i followed your advice by adding/changing the lines
>>  abt = iconv(abt,"utf-8","latin1")
>>  zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>> encoding = "latin1")
>> but this yielded the same results.
>
> Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird der
> Dateiname richtig angelegt ...
>
> Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt immer
> UTF-8,  aber das System liefert CP1252. Deswegen scheint die
> automatische Konvertierung nicht zu funktionieren
> (file(...,encoding..)). Was allerding immer geht, ist die explizite
> Konvertierung:
>
> a=file("foo","wt")
> writeLines(iconv(..., "utf-8","latin1"),a)
> close(a)
>
> (FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
> braucht man es eigentlich nicht wirklich ... ;))
>
> charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist 2-
> bytes bracht und latin1 nur eins.
>
> Viele Gruesse,
> Simon
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
>> Gesendet: Donnerstag, 27. Dezember 2007 21:40
>> An: Matthias Wendel
>> Cc: r-devel at r-project.org
>> Betreff: Re: [Rd] encoding question again
>>
>> Matthias,
>>
>> you get exactly what you specified - namely UTF-8. If you want your
>> html file to be latin1, then you have to say so:
>>
>> zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>> encoding = "latin1")
>>
>> In addition, you're assuming that `abt' is in the correct encoding
>> to be understood by your OS. If it's not, you better convert it into
>> one.
>> From your results it seems as if `abt' is also UTF-8 encoded. Since
>> you didn't tell us where you got that from, you should either fix
>> the source or use something like iconv(abt,"utf-8","latin1"):
>>
>> (in UTF-8 locale)
>>> abt="nür"
>>> cat(abt,"\n")
>> nür
>>> charToRaw(abt)
>> [1] 6e c3 bc 72
>>> charToRaw(iconv(abt,"utf-8","latin1"))
>> [1] 6e fc 72
>>
>> Cheers,
>> Simon
>>
>>
>> On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
>>
>>> Hi, R Devils,
>>> I'm running the actual R version in JGR (version 1.5-8 ).
>>> Sys.getlocale(category = "LC_ALL") yields [1]
>>> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
>>> 1252;LC_MONETARY=German_Germany.
>>> 1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>>>
>>> I want to write some HTML-Code enhanced by statistical results and
>>> labels encoded in Latin-1, which I pass to a function. Some label
>>> shall generate the filename. Although the labels are correctly
>>> handled
>>> in JGR they are somehow converted when they are written to the file.
>>> Also the filename is not constructed as wanted. The function
>>> definition is correctly sourced into R. The function is defined like
>>> this:
>>>
>>> Itemtabelle.head <- function (abt ){
>>> # nür zöm TÄST
>>> zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>>> encoding = "UTF-8")
>>> cat(as.character("<html
>>> xmlns:o=\"urn:schemas-microsoft-com:office:office
>>> \" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
>>> xmlns=\"http://www.w3.org/TR/REC-html40
>>> \">  \n"),
>>>     as.character("
>>> <
>>> head
>>>>
>>>
>>> \n "),
>>> 		.
>>> 		.
>>> 		.
>>>     as.character("        <td colspan=5 class=xl28 width=727 style=
>>> \'width:545pt\'>Gesundheitsindikatoren:  "), abt, as.character("</
>>> td>                                   \n"),
>>>     as.character("       </
>>> tr
>>>>
>>>
>>> "), file  = zz)
>>>     close(zz)
>>>     unlink(zz)
>>> }
>>> Setting abt as " Ärzte Innere, Gynäkologie" and calling the function
>>> with this argument, yields a filename "Itemtabelle  Ärzte Innere,
>>> Gynäkologie .html" and in the file a line
>>>       <td colspan=5 class=xl28 width=727 style='width:
>>> 545pt'>Gesundheitsindikatoren:    Ärzte Innere, Gynäkologie </
>>> td>
>>> is generated.                                 .
>>> I tried to solve this by using iconv, without success.
>>> The problem remains the same in the rgui and rterm - in rterm the
>>> resulting filename is "Itemtabelle Žrzte Innere,  
>>> Gyn„kologie  .html".
>>>
>>> Cheers,
>>> Matthias
>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list