[R] error in NORM lib

Roel de Jong dejongroel at gmail.com
Tue Nov 15 10:15:26 CET 2005


try MICE package :)

regards,
	Roel de Jong

(Ted Harding) wrote:
> Folks,
> 
> Leo Gürther and I have been privately discussing the
> problems with imputation using NORM which he originally
> described on 9 November. Essentially, he observed that
> many of the imputed missing values were totally absurd,
> being well out of any range comatible with the observed
> values of the variables.
> 
> After following a few false trails, we have discovered
> the reason. People interested in using NORM (and CAT and
> MIX and maybe PAN) may well be interested in this reason!
> 
> The dataset, which can be downloaded from his URL
> 
>   http://www.anicca-vijja.de/lg/dframe.Rdata
> 
> consists of a matrix with 74 columns and 200 rows.
> There are 553 missing values out of the 14800 (less
> than 4%), and the distributions of the observed values
> of the variables are well-behaved. So this should not
> be a problematic dataset.
> 
> 61 of the 74 columns have missing values (NAs) in them,
> and this is the reason why NORM fails.
> 
> Specifically, the first few lines of the code of the
> function prelim.norm() are as follows:
> 
>     if (is.vector(x)) 
>         x <- matrix(x, length(x), 1)
>     n <- nrow(x)
>     p <- ncol(x)
>     storage.mode(x) <- "double"
>     r <- 1 * is.na(x)
>     nmis <- as.integer(apply(r, 2, sum))
>     names(nmis) <- dimnames(x)[[2]]
>     mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1)
> 
> and, as can be seen from the last line, if there are
> missing values in a column with index > 31 then
> 
>   (r %*% (2^((1:ncol(x)) - 1))) + 1 >= 2^31
> 
> and then applying as.integer() to this value returns NA
> since as.integer only works for numbers no greater than
> .Machine$integer.max, normally 2^31 - 1. (Is the situation
> different for R on say 64-bit machines?)
> 
> The value of mdp[i] is a "packed" binary encoding of the
> column positions of any NAs in row i: if bit j-1 (counting
> from 0) in the binary representation of mdp[i] is 1, then
> there is an NA in column j of row i.
> 
> The vector mdp is used at various places in the NORM routines,
> and the effect on the imputations of having NAs in it, when
> the functioning of the routines depends on unpacking the
> encoding, is catastrophic. (Experiment had shown, indeed,
> that imputing with a subset of fewer than 32 columns always
> gave acceptable results).
> 
> The upshot of this is that NORM cannot be used for multiple
> imputations if there are more than 31 columns in the data
> which have NAs in them.
> 
> You could have more than 31 columns of data -- indeed Leo's
> 74 would have worked then -- if the columns are re-ordered
> so that all the columns with NAs are at the left, provided
> there are fewer than 32 with NAs. Unfortunately Leo has 61.
> 
> There is in principle no necessity to represent NA positions
> in this way, but that is how Shafer did it and it was carried
> over into R. An alternative method would simply be to have
> a 0/1 matrix of NA indicators, but the code for the NORM
> functions would have to be picked through to replace the
> unpacking of mdp -- and this includes FORTRAN routines
> (Oh dear, echoes of the "open source and R" discussion)!
> 
> So removing this limitation would not be trivial.
> 
> I have not noticed mention of the limitation in the documentation
> of the NORM functions.
> 
> Exactly the same construction of mdp, and therefore exactly the
> same problem, occurs in prelim.cat in CAT, for which I'm joint
> maintainer with Fernando Tusell, so we had better try to look
> into that! Any lessons we learn will be broadcast, so should be
> useful for NORM as well.
> 
> And, for good measure, in MIX it occurs twice over in prelim.mix:
> once in constructing mdpz for the continuous variables, and
> once in mdpw for the categorical variables. This is perhaps
> less likely in practice to cause the problem in MIX, since it
> would arise only if either there were more than 31 columns of
> continuous variables with NAs, or more than 31 of categorical
> variables; so MIXers can spread their bets.
> 
> Again, I have not noticed that the limitation is mentioned
> in the documentation of MIX; and I'm pretty sure it is not
> in the documentation of CAT!
> 
> Any suggestions or guidance from people who are familiar with
> NORM and MIX will be most welcome.
> 
> I should add that I have not looked into PAN, but would not
> be surprised if it were there as well.
> 
> I've written this explanation in consultation with Leo Gürtler,
> and he has proposed that I should publish it to the R List;
> but please consider that it is a joint effort.
> 
> Best wishes to all,
> Ted.
> 
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 15-Nov-05                                       Time: 00:45:46
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list