[Rd] table(exclude = NULL) always includes NA

Martin Maechler maechler at stat.math.ethz.ch
Wed Aug 10 19:39:52 CEST 2016


>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 9 Aug 2016 15:35:41 +0200 writes:

>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Sun, 7 Aug 2016 15:32:19 +0000 writes:

> > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .
> 
> > With R 2.7.2:
> 
> > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > table(a, b, exclude = NULL)
> >       b
> > a      1 2
> >   1    1 1
> >   2    2 0
> >   3    1 0
> >   <NA> 1 0
> 
> > With R 3.3.1:
> 
> > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > table(a, b, exclude = NULL)
> >       b
> > a      1 2 <NA>
> >   1    1 1    0
> >   2    2 0    0
> >   3    1 0    0
> >   <NA> 1 0    0
> > > table(a, b, useNA = "ifany")
> >       b
> > a      1 2
> >   1    1 1
> >   2    2 0
> >   3    1 0
> >   <NA> 1 0
> > > table(a, b, exclude = NULL, useNA = "ifany")
> >       b
> > a      1 2 <NA>
> >   1    1 1    0
> >   2    2 0    0
> >   3    1 0    0
> >   <NA> 1 0    0
> 
> > For the example, in R 3.3.1, the result of 'table' with
> > exclude = NULL includes NA even if NA is not present. It is
> > different from R 2.7.2, that comes from factor(exclude = NULL), 
> > that includes NA only if NA is present.
> 
> I agree that this (R 3.3.1 behavior) seems undesirable and looks
> wrong, and the old (<= 2.2.7) behavior for  table(a,b,
> exclude=NULL) seems desirable to me.
> 
> 
> > >From R 3.3.1 help on 'table', in "Details" section:
> > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts.  This is overridden by specifying 'exclude = NULL'.
> 
> > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".
> 
> Yes, it should be documented what happens for this case,
> (but read on ...)

and it is *not* true that the documentation does not say, since
2013, it has contained

 exclude: levels to remove for all factors in ‘...’.  If set to ‘NULL’,
          it implies ‘useNA = "always"’.  See ‘Details’ for its
          interpretation for non-factor arguments.


> > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified.
> 
> Yes.  What should we do?
> I currently think that we'd want to change the line
> 
>      useNA <- if (!missing(exclude) && is.null(exclude)) "always"
> 
> to
> 
>      useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always"
> 
> 
> which would not even contradict documentation, as indeed you
> mentioned above, the exact action here had not been documented.

The last part ("which ..") above is wrong, as mentioned earlier.

The above change entails behaviour which looks better to me;
however, the change *is* "against the current documentation".
and after experimentation (a "complete factorial design" of
argument settings), I'm not entirely happy with the result.... and one reason
is that   'exclude = NULL'  and  (e.g.)   'exclude = c()'
are (still) handled differently: From a usual interpreation,
both should mean 
  "do not exclude any factor entries (and levels) from tabulation"
but one of the two changes the default of 'useNA' and the other
does not.   If we want a change anyway (and have to update the doc),
it could be "more logical"  to replace the line above by

   useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"

notably, replacing 'useNA' *only* if it has not been specified,
which seems much closer to "typically expected" behavior..



> The change above at least does not break any of the standard R
> tests ('make check-all', i.e., including the recommended
> packages), which for me confirms that it may be "what is
> best"...
> 
> ----
> 
> Thank you for mentioning the important consequence for summary(<logical>).
> They can helping insight what a "probably best" behavior should
> be for these cases of table().
> 
> Martin Maechler,
> ETH Zurich
> 
> > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.
> 
> > With R 2.7.2:
> 
> > > log <- c(NA, logical(4), NA, !logical(2), NA)
> > > summary(log)
> >    Mode   FALSE    TRUE    NA's
> > logical       4       2       3
> > > summary(log[!is.na(log)])
> >    Mode   FALSE    TRUE
> > logical       4       2
> > > summary(TRUE)
> >    Mode    TRUE
> > logical       1
> 
> > With R 3.3.1:
> 
> > > log <- c(NA, logical(4), NA, !logical(2), NA)
> > > summary(log)
> >    Mode   FALSE    TRUE    NA's
> > logical       4       2       3
> > > summary(log[!is.na(log)])
> >    Mode   FALSE    TRUE    NA's
> > logical       4       2       0
> > > summary(TRUE)
> >    Mode    TRUE    NA's
> > logical       1       0
> 
> > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector.
> > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't  contain FALSE.
> 
> > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.
> 
> I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior
> for table() {and hence summary(<logical>)}.



More information about the R-devel mailing list