[R] SPSS data import: problems & work arounds for GSS surveys

Tue Mar 3 14:43:37 CET 2009

Dear Paul,

I encountered this problem the other day, and it went away when I updated
the foreign package from version 0.8-32 to 0.8-33.

I hope this helps,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
> Behalf Of Paul Johnson
> Sent: March-02-09 10:58 PM
> To: R-help
> Subject: [R] SPSS data import: problems & work arounds for GSS surveys
> 
> I'm using R 2.8.1 on Ubuntu 8.10.  I'm writing partly to ask what's
> wrong, partly to tell other users who search that there is a work
> around.
> 
> The General Social Survey is a long standing series of surveys
> provided by NORC (National Opinion Research Center).  I have
> downloaded some years of the survey data in SPSS format (here's the
> site: http://www.norc.org/GSS+Website/Download/SPSS+Format/).  When I
> try to import using foreign, I get an error like so:
> 
> > library(foreign)
> > dat <- read.spss("gss2006.sav", to.data.frame=T, trim.factor.names=T)
> Error in inherits(x, "factor") : object "cp" not found
> In addition: Warning messages:
> 1: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for variable
> TVRELIG
> 2: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for variable
SEI
> 3: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for
> variable FIRSTSEI
> 4: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for variable
> PASEI
> 5: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for variable
> MASEI
> 6: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 99.9 for variable
> SPSEI
> 7: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File contains duplicate label for value 0.75 for
> variable YEARSJOB
> 8: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
>   gss2006.sav: File-indicated character representation code (1252)
> looks like a Windows codepage
> 
> No dat object is created from this.
> 
> 
> I have found a work around.  I installed PSPP version 0.6.0 and used
> it to open the sav file, and then re-save it in SPSS sav  format.
> That creates an SPSS file that foreign's function can open.
> 
> I still see the warnings about redundant value labels, but as far as I
> can see these are harmless.  A working object is obtained like so:
> 
> > dat <- read.spss("gss-pspp.sav")
> Warning messages:
> 1: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for
> variable TVRELIG
> 2: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 0.75 for
> variable YEARSJOB
> 3: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
SEI
> 4: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for
> variable FIRSTSEI
> 5: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> PASEI
> 6: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> MASEI
> 7: In read.spss("gss-pspp.sav") :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> SPSEI
> 
> 
> There is still some trouble with the importation of this SPSS file,
> however.  It has the symptoms of being a non-rectangular data array, I
> think.  What do you think about these warnings:
> 
> > dat <- read.spss("gss-pspp.sav",to.data.frame=T)
> There were 22 warnings (use warnings() to see them)
> > warnings()
> Warning messages:
> 1: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for
> variable TVRELIG
> 2: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 0.75 for
> variable YEARSJOB
> 3: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
SEI
> 4: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for
> variable FIRSTSEI
> 5: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> PASEI
> 6: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> MASEI
> 7: In read.spss("gss-pspp.sav", to.data.frame = T) :
>   gss-pspp.sav: File contains duplicate label for value 99.9 for variable
> SPSEI
> 8: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 9: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 10: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 11: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 12: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 13: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 14: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 15: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 16: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 17: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 18: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 19: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 20: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 21: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 22: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 
> 
> While puzzling over this, I have tested the SPSS functions in the
> package memisc. This has some truly handy features!  Read ?importer
> and you'll see it can generate a list of variables as well as a
> codebook. It can also handle an SPSS portable file.
> Importer works a little bit like SPSS, actually, because the metadata
> is accessed, but the data is not really loaded until later (as far as
> I can tell, one must run either subset or as.data.set to force the
> actual data read). One can generate the description and codebook
> without accessing the data.
> 
> > idat <- spss.system.file("gss2006.sav")
> > show(idat)
> 
> SPSS system file 'gss2006.sav'
> 	with 5137 variables and 4510 observations
> 
> A subset function can access the particular variables from the data.
> 
> 
> > idat2 <- subset(idat,  select=c(gunlaw))
> > idat2
> 
> Data set with 4510 observations and 1 variables
> 
>    gunlaw
> 1  OPPOSE
> 2    *NAP
> 3    *NAP
> 4   FAVOR
> 5   FAVOR
> 6    *NAP
> 7   FAVOR
> 8    *NAP
> 9   FAVOR
> 10  FAVOR
> 11  FAVOR
> 12  FAVOR
> 13  FAVOR
> 14   *NAP
> 15   *NAP
> 16   *NAP
> 17  FAVOR
> 18   *NAP
> 19  FAVOR
> 20   *NAP
> 21   *NAP
> 22 OPPOSE
> 23   *NAP
> 24   *NAP
> 25   *NAP
> .. ......
> (25 of 4510 observations shown)
> 
> and the function "as.data.set" will force a full read of all the data
> columns:
> 
> 
> > idat3 <- as.data.set(idat)
> >
> 
> > table(idat3$gunlaw, idat2$gunlaw)
> 
>        0    1    2    8    9
>   0 2507    0    0    0    0
>   1    0 1568    0    0    0
>   2    0    0  395    0    0
>   8    0    0    0   35    0
>   9    0    0    0    0    5
> 
> 
> So, in conclusion, I've found troubles with read.spss in foreign, but
> have been able to work around that by accessing data with PSPP or the
> functions from the memisc package.   The only advantage of using the
> PSPS program (its GUI is psppire) is that you can see the data in a
> rectangular spreadsheet that is more-or-less searchable.  It has that
> same hard-to-use interface pioneered at SPSS (it hides variable names
> and displays descriptions in choosers). But the rectangular display in
> PSPP is nice.
> 
> pj
> 
> --
> Paul E. Johnson
> Professor, Political Science
> 1541 Lilac Lane, Room 504
> University of Kansas
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.