[Rd] read.spss issues

Jeroen Ooms jeroen.ooms at stat.ucla.edu
Wed Feb 15 07:05:29 CET 2012


Someone supplied me with a small SPSS datafile that caused a buffer
overflow and then a crash when reading it in R. It seems like a pretty
serious issue to me. Unfortunately I can't supply the dataset at hand
and I have a hard time reproducing it with a toy example. But I found
at least 2 issues that might be related.

The first one is that when the spss dataset has a 'string' variable
that is longer than 200 characters, it generates a bunch of warnings
and then additional variables in the dataset. E.g:

library(foreign)
x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/longstring.sav");
str(x);

The second problem is that the spss dataformat allows to specify
'duplicate labels', whereas this is not allowed for factors. read.spss
does not deal with this and creates a bad factor

x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav",
use.value.labels=T);
levels(x$opinion);

which causes issues downstream. I am not sure if this is an issue in
read.spss() or as.factor(), but I guess it might be wise to try to
detect duplicate levels and assign them all with one and the same
integer value when converting to a factor.

Thank you,

Jeroen



More information about the R-devel mailing list