[R] missing data imputation

Sat Jul 9 16:03:35 CEST 2005

On 08-Jul-05 Anders Schwartz Corr wrote:
> 
> Dear R-help,
> 
> I am trying to impute missing data for the first time using R.
> The norm package seems to work for me, but the missing values
> that it returns seem odd at times -- for example it returns
> negative values for a variable that should only be positive.
> Does this matter in data analysis, and/or is there a way to
> limit the imputed values to be within the minimum and
> maximum of the actual data? Below is the code I am using.

If you have a variable that should only be positive, then strictly
speaking you should not treat it as normally distributed, since
a normal distribution -- however large the mean, however small
the variance -- theoretically has positive probability of giving
negative values. So what you have observed in your data is within
the job-description of the normal distribution.

In practice, whether this matters in data analysis depends on
the range of values in a typical dataset, on the mean and SD
of a typical fitted normal distribution, on the probability
that such a distribution will give a negative value, and on
the sample size. (Evem if P(<0) is only 10^(-4), if you are
dealing with sample sizes of 10^6 you are very likely to get
some negative values).

Whether it matters in practice also depends, of course, on
whether it matters in practice. What, in the real world, will
break if there's a negative value or two in there?

In many cases people simply treat negative estimates of variables
which are intrinsically non-negative very crudely: if it comes
out negative, replaceit with zero. This too is often a quick
fix where the fact that it is a lie simply has no practical
importance. But, of course, it may matter! That depends ...
(see above).

It is also the case that imputed values generated by a procedure
such as NORM have greater dispersion than the variable itself.
This is a consequence of the way such imputation works, since
each imputation is drawn from a *random* instance of a normal
distribution, the mean and the variance of this distribution
being sampled from the Bayesian posterior distribution of these
parameters given the complete data and the covariates of the
incomplete data. So it is more likely that an imputed value will
be negative than that an observed value will be negative.

It is also worth looking at the shape of the histogram of such
a variable. In many applications (though not all), this may
exhibit positive skewness which would suggest that a log-normal
distribution would be a better fit in any case. In that case,
use the logarithm of the data, which will have (to within the
adequacy of fit) have a normal distribution. Run your imputations,
and then take the exponential of the results thereby transforming
back to the scale of the original variable. This result is necessarily
positive, so "anomalous" negative values simply cannot occur.

Also, remember that a variable to which you may have very reasonably
attributed a normal distribution (because of good fit to the data)
may be intrinsically positive solely for *semantic* reasons. E.g.
it may be a measured length. God made all lengths positive, and you
and we know this. But R, and NORM, and rnorom(), and all their
friends, do not know this. Of semantics they know nothing. And the
Daemon of Randomness will see a normal distribution, and mischievously
spit negative values at you, simply because they are there ...

However, this is just general advice, though it may give you
something to think about.

Meanwhile, I will try to have a look at the dataset whose URL
you give, and see if I have any more specific comments.

I've also noted Frank Harrel's comment about aregImpute, and
will bear it in mind. Note, however, that this does not do
multiple imputation on the same lines as NORM (or the other
Shafer-derived MI packages). See ?aregImpute section "Details".
And, specifically, from the "Description":

  "The 'transcan' function creates flexible additive imputation
   models but provides only an approximation to true multiple
   imputation as the imputation models are fixed before all
   multiple imputations are drawn. This ignores variability
   caused by having to fit the imputation models. 'aregImpute'
   takes all aspects of uncertainty in the imputations into
   account by using the bootstrap to approximate the process
   of drawing predicted values from a full Bayesian predictive
   distribution."

so that the Rubin/Shafer method described above (see paragraph
about dispersion of imputed values) is not fully implemented.

Best wishes,
Ted.