[R] patterns of missing data: determining monotonicity

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Fri Jan 7 15:13:20 CET 2005


On 06-Jan-05 Michael Friendly wrote:
> Here is a problem that perhaps someone out here has an idea
> about. It vaguely reminds me of something I've seen before,
> but can't place.  Can anyone help?
> 
> For multiple imputation, there are simpler methods available
> if  the patterns of missing data are 'monotone' --- if Vj is
> missing then all variables Vk, k>j are also missing, vs. more 
> complex methods required when the patterns are not monotone.
> The problem is to determine if, for a collection of variables,
> there is an ordering of them with a monotone missing data pattern,
> or, if not, what the longest monotone sequence is.
> 
> Here is an example, where in a dataset of 65 observations, there
> are 8 different patterns of missingness, with X and . representing
> observed and missing:
> 
> Group   V2   V3   V4   V5   V6   V7   V8   V9   V10   V11   nmiss
>   1     X    X    X    X    X    X    X    X     X     X      0  
>   2     X    X    X    X    X    X    .    X     X     X      1  
>   3     X    X    X    X    X    .    X    X     X     X      1  
>   4     X    X    X    X    X    .    .    X     X     X      2  
>   5     X    X    .    X    .    X    X    X     X     X      2  
>   6     X    X    .    .    X    X    X    X     X     X      2  
>   7     X    X    .    .    X    .    X    X     X     X      3  
>   8     X    X    .    .    .    X    X    X     X     X      3  
> 
> Treated as a binary matrix, one can sort the columns by the number
> of non-missing for each variable, and monotone means that there
> are at most 2 runs -- a string of 0s followed by all 1s for *all*
> patterns. But how
> to determine an ordering (or orderings) of variables of maximal length?
> 
> Group   V2   V3   V9   V10   V11   V6   V8   V5   V7   V4   nmiss
>   1      0    0    0    0     0     0    0    0    0    0     0  
>   2      0    0    0    0     0     0    1    0    0    0     1  
>   3      0    0    0    0     0     0    0    0    1    0     1  
>   4      0    0    0    0     0     0    1    0    1    0     2  
>   5      0    0    0    0     0     1    0    0    0    1     2  
>   6      0    0    0    0     0     0    0    1    0    1     2  
>   7      0    0    0    0     0     0    0    1    1    1     3  
>   8      0    0    0    0     0     1    0    1    0    1     3  
>         ==   ==   ==   ===   ===   ==   ==   ==   ==   ==
>          0    0    0    0     0     2    2    3    3    4        

Hi Michael,

Consider the following approach. It's not a full solution to
the specific problem you have posed above, but it contains
pathways to solutions.

If you're doing multiple imputation anyway, you should install
the packages "cat" (for categorical data), "norm" (for continuous
data, assumed Normal) and "mix" (for data mixing both kinds),
and also "pan" for MI on "panel" data, which might also be useful
to you.

I'll discuss the situation using "cat" as an example, though
"norm" works the same way as far as this question is concerned.

First make sure your data are arranged as a matrix X (say)
with rows representing "cases" and columns variables. If the
variables are categorical, make sure that their values are
represented as integers 1, 2, 3, ... (don't start with "0"),
and represent missing values as NA.

Example of data matrix X:

  X
        [,1] [,2] [,3]
   [1,]    3    1    2
   [2,]    2    1    3
   [3,]    2    1   NA
   [4,]    2    3   NA
   [5,]    1    3   NA
   [6,]    2   NA   NA
   [7,]    2   NA   NA
   [8,]    3   NA   NA
   [9,]   NA   NA   NA
  [10,]   NA   NA   NA

(constructed to have monotone pattern). Now shuffle it:

  X<-X[,sample(1:3)]
  X<-X[sample(1:10),]

  X
        [,1] [,2] [,3]
   [1,]    1    2    3
   [2,]    3   NA    2
   [3,]    1   NA    2
   [4,]    3   NA    1
   [5,]    1    3    2
   [6,]   NA   NA   NA
   [7,]   NA   NA    2
   [8,]   NA   NA    3
   [9,]   NA   NA    2
  [10,]   NA   NA   NA

Consider this as a real data matrix where now it is not
obvious that it has monotone missingness pattern. Then:

  library(cat)
  s <- prelim.cat(X)

Now read *very*carefully"

  ?prelim.cat

and in particular what is said about its value (the value of s).
Note also what is *not* said about it!

Now look at "s" by printing it to the console. Amongst its 17
components the following are of particular interest.

  s$x
        [,1] [,2] [,3]
   [1,]    1    3    2
   [2,]    1    2    3
   [3,]    3   NA    1
   [4,]    1   NA    2
   [5,]    3   NA    2
   [6,]   NA   NA    2
   [7,]   NA   NA    2
   [8,]   NA   NA    3
   [9,]   NA   NA   NA
  [10,]   NA   NA   NA

You can see that this is the same as X except that rows have
been permuted to push the NAs downwards. The component

   s$ro
   [1]  2  5  4  3  1  9  6  8  7 10

shows the permutation: the original Row 1 of X is Row 2 of s$x,
the original row 2 of X is Row 5 of s$x, and so on.

Now look at the component s$nmis of s:

   s$nmis
   [1] 5 8 2

This gives the numbers of missing values in the different
columns of X (and of s$x since the order of columns has not
been changed).

Now you can sort s$nmis into decreasing order using the
"index.return=TRUE" option of 'sort' so as to get the
column permutation:

  sort(s$nmis,index.return=TRUE)
  $x
  [1] 2 5 8

  $ix
  [1] 3 1 2

You can check directly that s$x[,c(3,1,2)] is in monotone
pattern; more directly, you can get X re-structured into
monotone pattern as

  s$x[,sort(s$nmis,index.return=TRUE)$ix]
        [,1] [,2] [,3]
   [1,]    2    1    3
   [2,]    3    1    2
   [3,]    1    3   NA
   [4,]    2    1   NA
   [5,]    2    3   NA
   [6,]    2   NA   NA
   [7,]    2   NA   NA
   [8,]    3   NA   NA
   [9,]   NA   NA   NA
  [10,]   NA   NA   NA

I hope this is some help. At least it shows you places where
you can start digging. If the original X is incompatible with
monotone pattern, then the above should give you something
which is close to monotone, though I'm not sure whether it
will get you "as close as possible"; and you may need to
do some more work to uncover how to determine your "longest
monotone sequence".

In any case, since these MI packages (all based on Shafer's
original S code) work internally with monotonicity in mind,
for reasons of efficiency and fast convergence, you may find
that your imputation needs are met by them.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 07-Jan-05                                       Time: 14:13:20
------------------------------ XFMail ------------------------------




More information about the R-help mailing list