[Rd] 1954 from NA

Mon May 24 11:26:12 CEST 2021

Hmm...
If it was only one column then your solution is neat. But with 5-600
variables, each of which can contain multiple missing values, to double
this number of variables just to describe NA values seems to me excessive.
Not to mention we should be able to quickly convert / import / export from
one software package to another. This would imply maintaining some sort of
metadata reference of which explanatory additional factor describes which
original variable.

All of this strikes me as a lot of hassle compared to storing some
information within a tagged NA value... I just need a little bit more bits
to play with.

Best wishes,
Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <
r-devel using r-project.org> wrote:

> Arguably, R was not developed to satisfy some needs in the way intended.
>
> When I have had to work with datasets from some of the social sciences I
> have had to adapt to subtleties in how they did things with software like
> SPSS in which an NA was done using an out of bounds marker like 999 or "."
> or even a blank cell. The problem is that R has a concept where data such
> as integers or floating point numbers is not stored as text normally but in
> their own formats and a vector by definition can only contain ONE data
> type. So the various forms of NA as well as Nan and Inf had to be grafted
> on to be considered VALID to share the same storage area as if they sort of
> were an integer or floating point number or text or whatever.
>
> It does strike me as possible to simply have a column that is something
> like a factor that can contain as many NA excuses as you wish such as "NOT
> ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN
> LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This
> additional column would presumably only have content when the other column
> has an NA. Your queries and other changes would work on something like a
> data.frame where both such columns coexisted.
>
> Note reading in data with multiple NA reasons may take extra work. If your
> errors codes are text, it will all become text. If the errors are 999 and
> 998 and 997, it may all be treated as numeric and you may not want to
> convert all such codes to an NA immediately. Rather, you would use the
> first vector/column to make the second vector and THEN replace everything
> that should be an NA with an actual NA and reparse the entire vector to
> become properly numeric unless you like working with text and will convert
> to numbers as needed on the fly.
>
> Now this form of annotation may not be pleasing but I suggest that an
> implementation that does allow annotation may use up space too. Of course,
> if your NA values are rare and space is only used then, you might save
> space. But if you could make a factor column and have it use the smallest
> int it can get as a basis, it may be a way to save on space.
>
> People who have done work with R, especially those using the tidyverse,
> are quite used to using one column to explain another. So if you are asked
> to say tabulate what percent of missing values are due to reasons A/B/C
> then the added columns works fine for that calculation too.
>

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

	[[alternative HTML version deleted]]