[R] Impact of multiple imputation on correlations

Joshua Wiley jwiley.psych at gmail.com
Mon Aug 1 18:52:16 CEST 2011


Hi Tina,

That is quite a bit of missingness, especially considering the sample
size is not large to begin with.  This would make me treat *any*
result cautiously.  That said, if you have a reasonable idea what the
mechanism causing the missingness is or if from additional variables
in your study, you can model the missing data mechanism sufficiently
that you are confident (for some definition of confident) that the
missingness is random after accounting for your model (conditional
independence, I forget if Rubin calls it MCAR or MAR), you are in a
reasonable place to use MI and draw inferences from the results.

Even if you are uncertain about this, it is *not* any better to just
say, "well there was too much missing data for me to feel safe using
MI so here is the correlation based just on the observed data".  That
_will be biased_ unless the missing data mechanism is completely
random (even unconditioned on anything else in your study; for example
if participants flipped coins to decide which questions to respond
to).

When averaging correlations, it is conventional to average the inverse
hyperbolic function of the correlations and then use the hyperbolic
function to transform the averaged value back to the original units
(also known as Fisher's Z transformation).  The mice package may do
this automatically if there is a functiong to compute pooled
correlations.

How results between simply deleted cases with any value unobserved and
using MI varies.  There may be no difference, are larger difference,
or a smaller difference.

Looking at the scatter plot matrix from the different imputations, I
do not know that I would actually classify that as varying quite a
bit.  I realize the sign of the slope changes some, but that is not
too surprising because all of them are somewhat close to flat.  You
can compare the between imputation variance to the within imputation
variance (I think mice gives you this information).

I partly addressed your last question at the beginning---I would
certainly not trust the correlation obtained simply by deleting
missingness, but I also would not trust the result obtained using MI
unless it was well setup.  Although you have shown us some of the
data, you have not mentioned how you modelled the missingness.  This
can have a substantial impact on your results (and also their
trustworthyness).  mice provides a number of different models and you
have a choice in what variables you use if you collect a lot in your
study.

Given all of this, I would suggest finding a local statistician or
consultant to talk with about this.  Your question(s) are more
statistical than they are R related.  Also, in addition to learning
more about MI (there are several good books and articles on it that
you can look up or email me offlist and I can provide references if
you want), someone who is there can be more helpful because they will
have access to your whole dataset and can work with you to find the
best variables/model to model the missing data mechanism.

I hope this helps and good luck,

Josh


On Mon, Aug 1, 2011 at 12:03 AM,  <lifty.gere at gmx.de> wrote:
> Dear all,
>
> I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with.
>
> For example, let us consider two arbitrary variables in my study that have the following missingness pattern:
>
> Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%)
> Variable 1 available, Variable 2 missing: 37 (31,3%)
> Variable 1 missing, Variable 2 available: 10 (8,4%)
> Variable 1 missing, Variable 2 missing: 20 (16,9%)
>
> I am interested in the correlation between Variable 1 and Variable 2.
>
> Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available?
>
> Plot 1 (http://imgur.com/KFV9y&CmV1sl) provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016.
>
> Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -0.11 and p = 0.22.
>
> Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation.
>
> Q3. When going through the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2, http://imgur.com/KFV9yl&CmV1s). Is this normal?
>
> Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust?
>
> Thank you for your help in advance.
> Tina
> --
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list