[R] missing values imputation

Wed May 12 18:44:50 CEST 2004

On 12-May-04 Anne wrote:
>  What R functionnalities are there to do missing values imputation
> (substantial proportion of missing data)? 
> I would prefer to use maximum likelihood methods ; is the EM algorithm
> implemented? in which package?

Hi Anne,
R already has packages/libraries called "cat", "norm" and "mix" which,
while they are not part of the standard installation, can be readily
downloaded and installed from any CRAN website -- see under "contributed
sources".

These implement in R Schafer's S code for what he calls "CAT", "NORM"
and "MIX". These are for imputing missing data where the data are
respectively entirely categorical, entirley continous ("norm" operates
on the basis that the data are a sample from a multivariate normal
distribution) and a mixture of both (some variables categorical, some
continuous). All include routines for multiple imputation, and for
extracting appropriate information about the parameters from the
imputations.

Schafer also has an S function "PAN" which imoputes missing values
from "panel" data. I don;t think this has been implemented for R yet.

There is one type of data which also, I think, has nothing implemented
for R (and I have not heard of a specially written routine for S-plus
either). This is so-called "semi-continuous" data -- where the value
of a variable may either be "continuous" or else take a specific
value (typically zero). E.g. "How much did you spend on alcohol last
week?" -- answer may be a positive amount, maybe log-normally distributed,
or else zero. You can approach data of this kind with missing values
by combining "cat" and "norm", but it's tricky and may not correspond
to a valid model.

All of Schafer's methods use maximum-likelihood estimation of the
parameters for the first phase of the imputation, using the EM algorithm
(and I'll respond to Rolf Turner's comments shortly).

After that, you can make a simple imputation by sampling from the
distribution thus estimated, or in a more general and indeed sounder
way, first sample from the posterior parameter distribution, sample
imputed values from the resulting distribution, and then repeat
sampling from parameters and resulting distributions to build up
an array of datasets with the missing data filled in by multiple
imputation.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 12-May-04                                       Time: 17:44:50
------------------------------ XFMail ------------------------------