[R] spss imports--trouble with to.data.frame

Peter Ehlers ehlers at ucalgary.ca
Sat Nov 14 00:38:13 CET 2009


I can't really help you with your problem, but maybe
importing with use.value.labels=FALSE will at least
get rid of the 'duplicated levels' warnings.

  -Peter Ehlers

Paul Johnson wrote:
> My students are working with several SPSS dataset provided by the
> European Social Survey. If you register your name, you can download it
> too. This is the 2004 data, for example:
> 
> http://ess.nsd.uib.no/ess/round2/
> 
> I cannot give you the European Survey dataset, but you can download it
> for free if you like, and then you could run these commands to
> re-produce this weird pattern described below.
> 
> library(foreign)
> d2 <- read.spss("ESS3e03_2.por")
> warnings()
> 
> str(d2$HAPPY)
> d2 <- as.data.frame(d2)
> str(d2$HAPPY)
> 
> d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
> warnings()
> str(d2$HAPPY)
> 
> Here's my info for this example:
> 
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-pc-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] foreign_0.8-38
> 
> 
> The weirdness that follows is the difference between
> 
> d2 <- read.spss( ... , to.data.frame=T)
> 
> and
> 
> d2 <- read.spss ()
> d2 <- as.data.frame(d2)
> 
> The former causes all data to become <NA> but the latter seems mostly OK.
> 
> 
>> library(foreign)
>> d2 <- read.spss("ESS3e03_2.por")
> warnings()
> There were 12 warnings (use warnings() to see them)
>> Warning messages:
> 1: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
>   duplicated levels will not be allowed in factors anymore
> 2: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
>   duplicated levels will not be allowed in factors anymore
> 3: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know",  ... :
>   duplicated levels will not be allowed in factors anymore
> 4: In `levels<-`(`*tmp*`, value = c("No second language mentioned",  ... :
>   duplicated levels will not be allowed in factors anymore
> 5: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl",  ... :
>   duplicated levels will not be allowed in factors anymore
> 6: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
> folkskola/grundskola\"",  ... :
>   duplicated levels will not be allowed in factors anymore
> 7: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
> senior officials and managers",  ... :
>   duplicated levels will not be allowed in factors anymore
> 8: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
> senior officials and managers",  ... :
>   duplicated levels will not be allowed in factors anymore
> 9: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt",  ... :
>   duplicated levels will not be allowed in factors anymore
> 10: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti",  ... :
>   duplicated levels will not be allowed in factors anymore
> 11: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias",  ... :
>   duplicated levels will not be allowed in factors anymore
> 12: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige",  ... :
>   duplicated levels will not be allowed in factors anymore
> 
>> str(d2$HAPPY)
>  Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
> 
>> d2 <- as.data.frame(d2)
>> str(d2$HAPPY)
>  Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
> 
> That appears valid.  On my first effort, I had tried to get the data
> frame in a single shot with read.spss
> 
>> d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
> There were 15 warnings (use warnings() to see them)
>> warnings()
> Warning messages:
> 1: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 2: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 3: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
>   longer object length is not a multiple of shorter object length
> 4: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
>   duplicated levels will not be allowed in factors anymore
> 5: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
>   duplicated levels will not be allowed in factors anymore
> 6: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know",  ... :
>   duplicated levels will not be allowed in factors anymore
> 7: In `levels<-`(`*tmp*`, value = c("No second language mentioned",  ... :
>   duplicated levels will not be allowed in factors anymore
> 8: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl",  ... :
>   duplicated levels will not be allowed in factors anymore
> 9: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
> folkskola/grundskola\"",  ... :
>   duplicated levels will not be allowed in factors anymore
> 10: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
> senior officials and managers",  ... :
>   duplicated levels will not be allowed in factors anymore
> 11: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
> senior officials and managers",  ... :
>   duplicated levels will not be allowed in factors anymore
> 12: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt",  ... :
>   duplicated levels will not be allowed in factors anymore
> 13: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti",  ... :
>   duplicated levels will not be allowed in factors anymore
> 14: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias",  ... :
>   duplicated levels will not be allowed in factors anymore
> 15: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige",  ... :
>   duplicated levels will not be allowed in factors anymore
> 
>  > str(d2$HAPPY)
>  Factor w/ 13 levels "Extremely unhappy",..: NA NA NA NA NA NA NA NA NA NA ...
> 
> Oh, heck, all the values are missing!! Somehow, putting
> "to.data.frame" inside the read.spss causes a different outcome than
> using as.data.frame after reading in the data.
> 
> The symptoms of this in R-2.9 are a little different, but the
> conclusion the same.  Help?
> 
> In case you are a student who wants to work with this data, I can
> share to you the large script that I have been accumulating so that
> you might "play along".  It turns out to be surprisingly difficult to
> "recode" these factor variables that have levels like "none", "1",
> "2",..."9", "total".
> 
> 
> 
> ## Paul Johnson
> ## November 13, 2009
> 
> ## A question arose in the lab. A student asks "I want
> ## to compare the answers from two different editions
> ## of the European Social Survey.
> 
> ## I will add this to Stuff Worth Knowing later, but
> ## I can share this tutorial to you right now.
> 
> ## From this website:
> 
> ## http://ess.nsd.uib.no/ess
> 
> ## Download those European Social Survey Datasets into a directory.
> 
> ## In a terminal, use the unzip command:
> ## unzip ESS3e03_2.spss.zip
> 
> ## unzip ESS2e03_1.spss.zip
> 
> ## Then run the following in R.
> 
> 
> library(foreign)
> 
> d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
> 
> 
> d2 <- read.spss("ESS3e03_2.por")
> warnings()
> 
> ### You can try to go into a data frame in one
> ### step, that's an option in read.spss. But
> ### we saw warnings, and wanted to be careful.
> 
> d2 <- as.data.frame(d2)
> d2$whichSurvey <- 2
> 
> d3 <- read.spss("ESS2e03_1.por")
> 
> d3 <- as.data.frame(d3)
> d3$whichSurvey <- 3
> 
> namesd2 <- names(d2)
> namesd3 <- names(d3)
> 
> commonNames <- intersect( namesd3, namesd2)
> 
> combod23 <- rbind(d2[ , commonNames], d3[, commonNames])
> 
> save(combod23, file="combod23.Rda")
> 
> 
> ## Error
> ##Warning messages:
> ##1: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
> ##  invalid factor level, NAs generated
> ##2: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
> ##  invalid factor level, NAs generated
> ##3: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 1, 1, 1, 1, 1, 1, 1,  :
> ##  invalid factor level, NAs generated
> 
> ## That worries me a little bit. The warnings did too.
> 
> ## Inspect a few lines in the result.
> 
> combod23[1:4, ]
> 
> ## fix doesn't work for me, did not bother to investigate.
> 
> ##> fix(combod23)
> ##Error in edit.data.frame(get(subx, envir = parent), title = subx, ...) :
> ##  can only handle vector and factor elements
> ## That means some data from hell came into this thing.
> 
> ## I suspect that combod23 is OK.
> 
> ## The memory use on this exercise is huge! Try to help it
> 
> rm (d2)
> rm (d3)
> 
> 
> ## But I worry. I have 2 ways that I use to try to figure this
> ## out. One is to open the dataset in a clone of SPSS called
> ## "PSPP". Actually, the executable is "psppire".
> ##
> ## The other thing I do is open the same data again in
> ## a numeric format, and compare the 2 combined data frames
> 
> ## This is also a useful exercise because it helps you
> ## understand what a "factor" is in R.
> 
> dn2 <- read.spss("ESS3e03_2.por", use.value.labels = F)
> 
> 
> dn2 <- as.data.frame(dn2)
> dn2$whichSurvey <- 2
> 
> dn3 <- read.spss("ESS2e03_1.por", use.value.labels = F)
> 
> dn3 <- as.data.frame(dn3)
> dn3$whichSurvey <- 3
> 
> ## Might be smart to compare
> # dn2$HAPPY[1:50]
> # d2$HAPPY[1:50]
> 
> namesdn2 <- names(dn2)
> namesdn3 <- names(dn3)
> 
> commonNNames <- intersect( namesdn3, namesdn2 )
> 
> combodn23 <- rbind(dn2[ , commonNNames], dn3[, commonNNames])
> 
> save(combodn23, file="combodn23.Rda")
> 
> table( combod23$HAPPY, combodn23$HAPPY)
> 
> ## In summary, whenever I want to use a variable from
> ## the combined data frame, I would probably compare
> ## against combodn23 just to feel safe.
> 
> 
> 
> 
> ## Note, after when you come back to work on this project again, you
> ## might as well just reload the saved copies of combod23 and
> ## combodn23.
> 
> ## load("combod23.Rda")
> 
> ## load("combodn23.Rda")
> 
> ## That will put you at the current spot, no need to redo the merge
> 
> 
> ## Now, about "recoding". If you just want numerical
> ## data, you might consider using combodn23.
> 
> ## But if you want some factors and some numberical
> ## variables, then you might need to recode to reclaim
> ## values.
> 
> ## HAPPY turns out to be an interesting example of a
> ## PAIN IN THE ASS because in SPSS, it is scored from
> ## 0 to 10, but they give value labels only for scores
> ## 1=  Extremely unhappy
> ## and
> ## 10= Extremely happy
> ##
> ## And the SPSS column has no labels for values 1-9.
> ## If SPSS gave NO labels at all, then this would come
> ## into R as a numeric variable. BUT, because there are
> ## 2 levels named, then R makes a factor out of it.
> 
> ## When R turns it into a factor, you
> ## end up with a nutty looking factor, which has
> ## levels you don't really appreciate.
> 
> levels(combod23$HAPPY)
> # [1] "Extremely unhappy" "1"                 "2"
> # [4] "3"                 "4"                 "5"
> # [7] "6"                 "7"                 "8"
> #[10] "9"                 "Extremely happy"   "Refusal"
> #[13] "Don't know"        "No answer"
> 
> 
> 
> ## Create a new variable to play with
> combod23$HAPPY2 <- combod23$HAPPY
> 
> ## Change Extremely Unhappy to text "0"
> levels(combod23$HAPPY)[1] <- "0"
> ## Change Extremely Happy to "10"
> levels(combod23$HAPPY)[11] <- "10"
> 
> HELL <- levels(combod23$HAPPY)
> 
> ### Look at HELL
> 
> HELL
> 
> combod23$HAPPY2[combod23$HAPPY %in% HELL[12:14] ] <- NA
> 
> ##CHECK RESULT
> table(combod23$HAPPY, combod23$HAPPY2)
> 
> 
> ## Eliminate the unused levels from HAPPY2
> combod23$HAPPY2 <- factor(combod23$HAPPY2)
> ### Same is found with
> ## combo23$HAPPY2 <- combo23$HAPPY2[ , drop=T]
> 
> ## Use the "factor trick" to
> ## reset the variable back to numeric:
> 
> combod23$HAPPYN <- as.numeric(HELL)[combod23$HAPPYN]
> 
> ##CHECK RESULT
> table(combod23$HAPPY, combod23$HAPPY2)
> 
> ## CHECK by comparing against numeric data from spss
>  table(combodn23$HAPPY, combod23$HAPPYN)
> 
> 
> 
> 
> ## Next, a student asks "how can I make that same recode
> ## on a lot of variables?" I'm going to have to leave
> ## that one unanswered.  I think the answer will probably
> ## be to get a list of variables, then use "lapply" to
> ## do the same thing to each variable in turn.  But
> ## I have not written up a simple, understandable example
> ## yet
> 
> 
> 
> ## After the data is all recoded and homogenized, then we
> ## could run any analysis we want, and throw in the variable
> ## "whichSurvey" to see if there is a difference beteween the
> ## two models.
> 
> ## Example, choose your y and x1 and x2, then
> 
> ## mod <- lm(y~ (x1+x2)*whichSurvey, data=combod23)
> 
> ## or if you think the difference is just in the intercept:
> 
> ## mod <- lm(y~ x1+x2 + whichSurvey, data=combod23)
>




More information about the R-help mailing list