[R] read.spss and encodings

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Thu Feb 1 14:18:30 CET 2007


Thomas Friedrichsmeier wrote:
> Hi!
>
> I'm having trouble with importing spss files containing non-ascii characters 
> (R 2.4.1, debian linux, i386). To reproduce:
>
> Download the following file: 
> http://statmath.wu-wien.ac.at/data/spss/de/comphomeneu.sav
>
> require (foreign)
> Sys.setlocale (locale="C")
> read.spss("comphomeneu.sav")$ARBEIT[1]
> # prints:
> # [1] im B\374ro
> # Levels: im B\374ro zuhause
>
> \374 of course is actually a u-umlaut. However, I guess in the C locale it's 
> not expected to print as such. But now try this (use any UTF-8 locale you may 
> have installed):
>
> Sys.setlocale (locale="de_DE.UTF-8")
> read.spss("comphomeneu.sav")$ARBEIT[1]
> # prints:
> # [1]Error in print.default(xx, quote = quote, ...) :
> #        invalid multibyte string
>
> To me it looks, like read.spss () would probably need an encoding parameter, 
> and / or some iconv () magic. Now, locale conversion always makes my head 
> spin, so I thought I'd better post here, before calling this to be a bug in 
> R. Two questions:
>
> 1) Is there some way to work around this, i.e. make sure it is converted to 
> proper UTF-8 while importing? Am I missing something obvious
>   
> 2) Should I submit this as a bug report?
>   
1) Yes, 2) No

This is really not in read.spss, but in R itself. The short version is
that in released versions, we have

> "Im B\374ro"
[1]Error: invalid multibyte string

which is indeed a buglet, since it is not good if you cannot output what
you can input (notice that there is no problem until you try to print).
In r-devel, this has become

> "Im B\374ro"
[1] "Im B\xfcro"

so that invalid multibytes at least do not cause error. However, the
real issue is that the string  is in the wrong encoding for your locale,
so you should convert it:

> iconv("Im B\xfcro", from="latin1", to="UTF-8")
[1] "Im Büro"
> iconv("Im B\374ro",from="latin1", to="UTF-8")
[1] "Im Büro"


-p

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907



More information about the R-help mailing list