[R] Multiple imputation with plausible values already in the data

Tue Jul 17 14:58:48 CEST 2007

Hello,

this is not really an R-related question, but since the posting guide does not
forbid asking non-R questions (even encourages it to some degree), I though I'd
give it a try.

I am currently doing some secondary analyses of the PISA (http://pisa.oecd.org)
student data. I would like to treat missing values properly, that is using
multiple imputation (with the mix package). But I am not sure how to do the
imputation, since the data set provided by the OECD already contains variables
with plausible values.

Roughly, the situation is like this: for each of the cognitive (achievement)
scales, there are five variables holding plausible values. So for example, there
is not one variable for math achievement, but five, pv1math through pv5math.
There are, of course, no missing values on these variables.

Most other variables show some degree of missing data. For example, some
students did not report their parents' occupation, so there is no information
about the socio-economic background (HISEI). This is the kind of data I want to
impute.

My first thought was splitting the data into five datasets, each holding only
one of the plausible value variables, but all of the "normal" variables. So e.g.
the first data set would include pv1math, pv1read, HISEI, and gender; while the
second would include pv2math, pv2read, HISEI, and gender. I would run mix on the
five data sets independently and end up with five imputed data sets with no
missing values.

But is this a valid approach? There would actually be two imputation runs per
data set: one for the plausible values on the achievement scales (done by the
OECD under an unknown model), and one for the other variables (done by me with
mix). The second run would use data from the first. Would this not lead to an
overestimation of the imputation variance? What alternative approaches are there?

Thank you in advance for you answers,

Uli