[R] read.spss and umlaut

Thomas Kuster r at fam-kuster.ch
Thu Aug 3 10:38:13 CEST 2006


Hello

Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
> This sounds like a conflict between encodings -- eg if R is assuming UTF-8
> and the file is encoding in Latin-1 then the sequence
> U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
> U+0072 : LATIN SMALL LETTER R
> is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.

Hex:  74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
Text:  t  e     f  ü  r     a  l  l  e  S  E  /  1  6

> The underlying C code (being written in the US quite a long time ago)
> doesn't know about encodings, and I don't know what the rules are in SPSS
> for valid characters (I suspect that in these old portable file formats it
> probably just reads and writes bytes, leaving it up to the OS to interpret
> them.

But why stopp the C code reading? Is "/" not the endmark of the string? What 
is the problem, if I chance that in the source?

> You could try running R in a non-UTF-8 locale to see if it helps.

I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set 
an other temporary?

A dirty hack like this:
sed s/ä/ae/g | sed s/ö/oe/g | sed s/ü/ue/g | sed s/Ä/Ae/g | sed s/Ö/Oe/g | sed 
s/Ü/Ue/g
didn't work (file 'projets_non_umlaut.por' is not in any supported SPSS 
format).

Thomas

> If anyone has definitive information about how SPSS represents strings and
> decides on valid characters that might be useful too.
>
>  	-thomas
>
> >> library("foreign")
> >> spssdaten <- read.spss("projets.por")
> >> attr(spssdaten$PROJETX, "value.labels")[1:20]
> >
> >              Bg Stammzellenforschung                                  Bb
> >                                  863                                  
> > 862 Bb Neugestaltung des Finanzausgleichs
> >                                  861                                  
> > 854 EV Postdienste f                                   Bb 853            
> >                       852 Bb                         Bg Steuerpaket 851  
> >                                 843 Bb Anhebung der Mehrwertsteuer s     
> >                 11. AHV-Revision 842                                  
> > 841 Volkinitiative Lebenslange Verwahrung
> >                                  833                                  
> > 832 Gegenentwurf zur Avanti             EV Lehrstellen-Initiative 831    
> >                               824 EV Moratorium Plus                   
> > EV Strom ohne Atom 823                                   822 EV Ja zu
> > fairen Mieten                   EV Gleiche Rechte f 821                  
> >                 815 EV Gesundheitsinitiative                EV
> > Sonntags-Initiative 814                                   813
> >
> > The SPSS-File is okay:
> >> system("cat projets.por |grep Postdienste")
> >
> > echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg 
> > EOG Mut
> >
> > How can I read the SPSS-File with the Umlaut?
> >
> > Bye
> > Thomas Kuster
> >
> > R: 2.1.0 (2005-04-18)
> > OS: Debian Linux, 2.6.10-isgee-neptun-1
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html and provide commented,
> > minimal, self-contained, reproducible code.
>
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle



More information about the R-help mailing list