[Rd] FW: 1954 from NA

Tue May 25 04:51:20 CEST 2021

Adrian,

Agreed. To do what you said hundreds of columns of data by doubling it is indeed a pain just to get what you want. There are straightforward ways especially if you use tidyverse packages rather than base R. Just a warning, this message is a tad long for anyone not interested to skip.

But there is a caution about trying to use a feature nobody wanted changed until you came along. R has all kinds of dependencies on existing ways of looking at an NA value such as asking is.na(SOMETHING) or the many function like a mean where they handle mean(SOMETHING, na.rm=TRUE) or the way ggplot graphs skip items that are NA and so on. Any solution you come up with to enlarge the kinds of NA may break some of that and then you will have no right to complain.

What does your data look like? I mean for example if all the data in a column is small integers say under a thousand, you can pick some number like 10,000 and store some NA categories as 10,000 + 1 then 10,000 + 2 and so on. THEN you have to be careful, as noted above, to remove all such values in other contexts by either making a copy where all numbers above 10,000 are changed to an NA for the duration or you take subsets of the data that exclude those.

Floating point can also be done that way or by using a negative number or other tricks.

Character DATA obviously can have reserved words that will not happen in the rest of the DATA such as NA*NA:1 and NA*NA:2 or whatever makes sense to you. Ideally this can be something you can remove all at once with something like perhaps a regular expression when needed. If you use a factor to store such a field, as if often a good idea, there are ways to force the indexes of your NA-like fields to be whatever you want, such as the first 10 or even last 10, perhaps letting you play games when they need to be hidden or removed or refactored into a plain NA. It adds complexity and may break in unexpected ways.

And, I shudder to say this, more modern usage allows you to change normal vectors into list variables as columns. So you can replace a single column (or as many as you want) by a tuple column where the first part may be your data including just NA when needed and the second item would  be something else like a more specific reason for any items where the first is NA. Heck, you can add a series of Boolean entries in the list where the second to the last each encode TRUE if it has a particular excuse and you can even have multiple excuses (or none) for an entry. I repeat, I shudder, simply because many other normally used R functions are not expecting list columns and you may need to call them indirectly with something that extracts only the part needed first.

R does have some inconsistencies in how it handles some things such as name tags associated with parts of a vector. Some functions preserve the attributes used this way and others do not.  But if you want to emulate the same tricks normally used in making factors and matrices or giving column names, you can do something like this that might work. My EXAMPLE below makes a vector of a dozen sequential numbers as the VALUE and hides an attribute with month names to sort of match: It then changes every third to be NA:

temp <- 1:12

attr(temp, "Month") <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

The current value of temp may look like this:

> temp

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

So it has months attached as PLACEHOLDERS and four different NA values. To see an NA value’s REASON, the two have the same index so:

> attr(temp, "Month")[is.na(temp)]

[1] "Mar" "Jun" "Sep" "Dec"

The above asked to see what text is associated with each NA. You can use many techniques like the above to find out why a particular item is an NA. If you want to know why the sixth item is NA, with R using index of 6 as it starts with 1:

> attr(temp, "Month")[6]

[1] "Jun"

And it can work both ways. If I now were to change the above to say store an NA in the Months variable (renamed by you to Reason or something) except for other entries saying “RanOutOfTime”,  “DidNotUnderstandQuestion” and so on, you could search the attribute first and get the index numbers of which questions matched and other such things.

There may well be a well-reasoned package as just described and perhaps some that do not use as much space. The above very rough implementation just hides a second vector attached loosely tied to the first vector in a way that may be invisible to most other functionality. But it can easily have problems as so many things make  new vectors and remove your change. Consider just doubling the odd vector I created:

> temp2 <- c(temp, temp)

> temp2

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA  1  2 NA  4  5 NA  7  8 NA 10 11 NA

The annotation is gone!

Now if you do something a tad more normal like re-use the names() feature, maybe you can preserve it in more cases:

temp <- 1:12

names(temp) <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

> temp

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> temp

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

1   2  NA   4   5  NA   7   8  NA  10  11  NA

Now NAMES used this way can be preserved sometimes. For example some functions have arguments like this:

> temp2 <- c(temp, temp, use.names=TRUE)

> temp2

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

1   2  NA   4   5  NA   7   8  NA  10  11  NA   1   2  NA   4   5  NA   7   8  NA  10  11  NA

So, it may well be you can play such games with your input but doing that for hundreds of columns may be a tad of work that can be automated easily enough if all the columns are similar, such as in repeats of data in a time series. As noted, R functions that read in DATA expect all items in a column to be of the same underlying type or an NA. If your data has text giving a REASON, and you know exactly what reasons are allowed with any remaining values to be left as is, you might do something like this in pseudo code.

Say column_orig looks like this: 5, 1, bad, 2, worse, 1, 2, 5, bad, 6, worse, missing, 2

Your stuff may be read in as CHARACTER and look like:

> column_orig

[1] "5"       "1"       "bad"     "2"       "worse"   "1"       "2"       "5"       "bad"    

[10] "6"       "worse"   "missing" "2"

So, you can process the above with something like an ifelse() to make a temporary version VERY carefully as ifelse does not preserve name attributes!

> names.temp <- ifelse(column_orig %in% c("bad", "worse", "missing"), column_orig, NA)

> column_orig <- ifelse(column_orig %in% c("bad", "worse", "missing"), NA, column_orig)

> column_orig <- as.numeric(column_orig)

> names(column_orig) <- names.temp

> column_orig

<NA>    <NA>     bad    <NA>   worse    <NA>    <NA>    <NA>     bad    <NA>   worse missing    <NA> 

  5       1      NA       2      NA       1       2       5      NA       6      NA      NA       2

(the above may not show up formatted right in the email but shows the names on the first line and the data on the second. Wherever the data is NA, the reason is in the name.

Again, I am just playing with your specified need and pointing out ways R may partially support them but probably far from ideal as you are trying to do something it probably was never designed for. I suspect the philosophy behind using a tibble instead of a data.frame may preserve your meta-info better.

But if all you want is to know the reason for a missing observation while using little space, there may be other ways to consider such as making a sparse matrix from the original data if missing values are rare enough. Sure, it might have 600 columns and umpteen rows, but you can store a small integer or even a byte in each entry and perhaps skip any row that has nothing missing. If you later need the info and the data has not been scrambled such as by removing rows or columns or sorting, you can easily find it. Or, if you simply add one more column with some form of unique sequence number or ID and maintain it, you can always index back to find what you want, WITHOUT all the warnings mentioned above.

If memory is a huge concern, consider ways you can massage your original data to conserve what you need then save THAT to a file on disk and remove the extra space use for garbage collection. When and IF you ever need that info at some later date, the form you chose can be read back in. But you need to be careful as such meta-info is lost unless you use a method that conserves it. Do not save it as a CSV file, for example, but as something R uses and can read back in the same way.

Or, you can try to make your own twists on changing how NA works and take lots of risks as it is not doing something published and guaranteed. 

I think I can now politely bow out of this topic and wish you luck with whatever you choose. It may even be using something other than R!

From: Adrian Dușa <dusa.adrian using unibuc.ro <mailto:dusa.adrian using unibuc.ro> > 
Sent: Monday, May 24, 2021 5:26 AM
To: Avi Gross <avigross using verizon.net <mailto:avigross using verizon.net> >
Cc: r-devel <r-devel using r-project.org <mailto:r-devel using r-project.org> >
Subject: Re: [Rd] 1954 from NA

Hmm...

If it was only one column then your solution is neat. But with 5-600 variables, each of which can contain multiple missing values, to double this number of variables just to describe NA values seems to me excessive.

Not to mention we should be able to quickly convert / import / export from one software package to another. This would imply maintaining some sort of metadata reference of which explanatory additional factor describes which original variable.

All of this strikes me as a lot of hassle compared to storing some information within a tagged NA value... I just need a little bit more bits to play with.

Best wishes,

Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <r-devel using r-project.org <mailto:r-devel using r-project.org> > wrote:

Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have had to adapt to subtleties in how they did things with software like SPSS in which an NA was done using an out of bounds marker like 999 or "." or even a blank cell. The problem is that R has a concept where data such as integers or floating point numbers is not stored as text normally but in their own formats and a vector by definition can only contain ONE data type. So the various forms of NA as well as Nan and Inf had to be grafted on to be considered VALID to share the same storage area as if they sort of were an integer or floating point number or text or whatever.

It does strike me as possible to simply have a column that is something like a factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column would presumably only have content when the other column has an NA. Your queries and other changes would work on something like a data.frame where both such columns coexisted.

Note reading in data with multiple NA reasons may take extra work. If your errors codes are text, it will all become text. If the errors are 999 and 998 and 997, it may all be treated as numeric and you may not want to convert all such codes to an NA immediately. Rather, you would use the first vector/column to make the second vector and THEN replace everything that should be an NA with an actual NA and reparse the entire vector to become properly numeric unless you like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an implementation that does allow annotation may use up space too. Of course, if your NA values are rare and space is only used then, you might save space. But if you could make a factor column and have it use the smallest int it can get as a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are quite used to using one column to explain another. So if you are asked to say tabulate what percent of missing values are due to reasons A/B/C then the added columns works fine for that calculation too.

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

	[[alternative HTML version deleted]]