[R] Still missing something on missing values...
Peter Dalgaard BSA
p.dalgaard at biostat.ku.dk
Sun Oct 27 01:11:45 CEST 2002
Matej Cepl <matej at ceplovi.cz> writes:
> I have a SPSS datafile which is used for my textbook in the
> statistics (and which is available on
> http://abacon.com/fox/s6720p2.sav, but it is originally from
> When I opened it with SPSS 10 and run Frequencies on it I
> have got 979 valid data a 27 missing. However, see below
> (unfortunately, I have used R in preparation of my homework,
> which caused me an error on this):
> > data=read.spss("s6720p2.sav")
> > levels(data$CP1)
>  "Rf" "Dk" "Neither" "Oppose" "Favor"
> > length(data$CP1[data$CP1=="Favor"])
>  727
> > length(data$CP1[data$CP1=="Oppose"])
>  177
> > length(data$CP1[data$CP1=="Neither"])
>  79
> > length(data$CP1[data$CP1=="Dk"])
>  19
> > length(data$CP1[data$CP1=="Rf"])
>  3
> > data$CP1[data$CP1=="Rf" | data$CP1=="Dk"]<-NA
> > length(data$CP1[!is.na(data$CP1)])
>  983
> > length(data$CP1[is.na(data$CP1)])
>  22
> > 727+177+79
>  983
> Now, what is even more strange is, that when I have exported just
> the variable CP1 from the full file (in SPSS) and run on it the
> same frequencies as in the full size version, the results were
> same as in R (yes, I have checked that the definition of the
> missing values was the same: 8,9 -- labelled as Rf and Dk).
> I have uploaded the data and all reports (in PDF) on
> Could anybody help me to understand what I did wrong, please?
The length(data$CP1[data$CP1=="Rf"]) construction is unsound (what
happens if there are NA in the indexing variable?) and you'd be better
off with sum(data$CP1 %in% "Rf") or simply table(data$CP1), but that
seems unrelated here.
As you say, your cp1.pdf is perfectly in accordance with the R output,
whereas cp1-whole_data.pdf differs. It also includes the rather
extraordinary claim that 979+27=1005 !! Is there any chance you may
have accidentally modified it?
[If your instructor still insists that SPSS must be right, and this
really is what it gives as output, I'd point out the obvious
discrepancies with itself and with the data set with just the CP1
variable in it, leaving R out of the discussion...]
What is ICPSR, btw?
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
More information about the R-help