[Rd] Windows iconv() "failure" in certain locales

Thu Jun 29 13:47:34 CEST 2017


On 29.06.2017 12:27, Martin Maechler wrote:
>>>>>> Uwe Ligges <ligges at statistik.tu-dortmund.de>
>>>>>>      on Wed, 28 Jun 2017 18:45:59 +0200 writes:
> 
>      > On 27.06.2017 17:36, Martin Maechler wrote:
>      >> This is a continuation of the R-devel thread with subject
>      >> "suggestion to fix packageDescription() for Windows users" :
>      >>
>      >> As I said there, a patch should rather address the underlying
>      >> problem in packageDescription rather than a kludgy workaround
>      >> patch for  citation().
>      >> (For that same reason, Ben Marwick proposed to fix
>      >> packageDescription() rather than the symptom seen in citation().)
>      >>
>      >> It's not hard to see that the problem is that  iconv() in
>      >> Windows does not always succeed to translate from "UTF-8" to the
>      >> "current locale", in the case mentioned there.
>      >>
>      >> I'm giving some easier reproducible examples:  no need to install
>      >> half of tidyverse just to get citation("readr") :
>      >>
>      >>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>      >>> Encoding(x1) <- "latin1"
>      >>> xU <- iconv(x1, "latin1", "UTF-8")
>      >>
>      >>> Sys.setlocale("LC_CTYPE", "Chinese")
>      >> [1] "Chinese (Simplified)_People's Republic of China.936"
>      >>>
>      >>> iconv(x1, "latin1", "") # NA NA NA
>      >> [1] NA NA NA
>      >>> iconv(xU, "UTF-8", "") # NA NA NA
>      >> [1] NA NA NA
>      >>> iconv(xU, "UTF-8", "//TRANSLIT")
>      >> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
> 
>      > Interesting, I get chinese characters here.
> 
> For which one of the above cases; can you show them
>   (it may survive E-mail servers; we had other
>    Chinese R strings on R-help / R-devel recently, right?)


x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
Sys.setlocale("LC_CTYPE", "Chinese")
# [1] "Chinese (Simplified)_People's Republic of China.936"
xU <- iconv(x1, "latin1", "UTF-8")
iconv(xU, "UTF-8", "//TRANSLIT")
# [1] "Ekstr鴐"         "J鰎eskog"        "bi遚hen Z黵cher


 > sessionInfo()
R Under development (unstable) (2017-06-28 r72861)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 
LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=German_Germany.1252 
LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.0


Best,
Uwe


> In any case, I think  that is even worse, isn't it > As also in a Chinese locale you'd want explicit-latin1 text to
> see in something that looks like latin-1 (I know from a master's
>   student that Windows+Chinese can well show latin-1-like
>   letters also interspersed in the Chinese text),
> no ?
> 
> 
>      > Beside the comments from Duncan Murdoch:
> 
>      > iconv(x1, "latin1", "", sub="?")
>      > etc. would be an alternative in case some characters really cannot be
>      > converted into the target encoding and should perhaps be considered for
>      > the time after Duncan commits the fix for the underlying porblem.
> 
> Yes. I'd had the same idea that's why I used it in the code I
> sent along.
> 
> So,
> 
> 1)  we definitely won't commit the workaround patch for citation().
> 
> 2) I have a "workaround patch" for packageDescription() which is
>     more useful in the sense that only if iconv() produces NA's, it
>     tries alternatives, notably   "//TRANSLIT",  "ASCII//TRANSLIT"
>     (the latter Ben also mentioned, but my patch would only use it
>      in the NA case) and also the same  'sub="?"' that you mention
>      above, Uwe.
> 
>     That patch is not Windows-specific and will automatically
>     also help in other cases / platforms where the iconv()
>     re-encoding leads to partial NAs.
>     
>    @Duncan M: would you _not_ want me to commit that either?
> 
> Martin
>