[Rd] 1954 from NA

Mon May 24 16:47:06 CEST 2021

On Mon, May 24, 2021 at 4:40 PM Bertram, Alexander via R-devel <
r-devel using r-project.org> wrote:

> Dear Adrian,
> SPSS and other packages handle this problem in a very similar way to what I
> described: they store additional metadata for each variable. You can see
> this in the way that SPSS organizes it's file format: each "variable" has
> additional metadata that indicate how specific values of the variable,
> encoded as an integer or a floating point should be handled in analysis.
> Before you actually run a crosstab in SPSS, the metadata is (presumably)
> applied to the raw data to arrive at an in memory buffer on which the
> actual model is fitted, etc.
>

As far as I am aware, SAS and Stata use "very high" and "very low" values
to signal a missing value. Basically, the same solution using a different
sign bit (not creating attributes metadata, though).

Something similar to the IEEE-754 representation for the NaN:
0x7ff0000000000000

only using some other "high" word:
0x7fe0000000000000

If I understand this correctly, compilers are likely to mess around with
the payload from the 0x7ff0... stuff, which endangers even the most basic R
structure like a real NA.
Perhaps using a different high word such as 0x7fe would be stable, since
compilers won't confuse it with a NaN. And then any payload would be "safe"
for any specific purpose.

Not sure how SPSS manage its internals, but if they do it that way they
manage it in a standard procedural way. Now, since R's NA payload is at
risk, and if your solution is "good" for specific social science missing
data, would you recommend R creators to adopt it for a regular NA...?

We're looking for a general purpose solution that would create as little
additional work as possible for the end users. Your solution is already
implemented in the package "labelled" with the function user_na_to_na()
before doing any statistical analysis. That still requires users to pay
attention to details which the software should take care of automatically.

Best,
Adrian

The 20 line solution in R looks like this:
>
>
> df <- data.frame(q1 = c(1, 10, 50, 999), q2 = c("Yes", "No", "Don't know",
> "Interviewer napping"), stringsAsFactors = FALSE)
> attr(df$q1, 'missing') <- 999
> attr(df$q2, 'missing') <- c("Don't know", "Interviewer napping")
>
> excludeMissing <- function(df) {
>   for(q in names(df)) {
>     v <- df[[q]]
>     mv <- attr(v, 'missing')
>     if(!is.null(mv)) {
>       df[[q]] <- ifelse(v %in% mv, NA, v)
>     }
>   }
>   df
> }
>
> table(excludeMissing(df))
>
> If you want to preserve the missing attribute when subsetting the vectors
> then you will have to take the example further by adding a class and
> `[.withMissing` functions. This might bring the whole project to a few
> hundred lines, but the rules that apply here are well defined and well
> understood, giving you a proper basis on which to build. And perhaps the
> vctrs package might make this even simpler, take a look.
>
> Best,
> Alex
>
>

	[[alternative HTML version deleted]]