[Rd] Surprising behavior of letters[c(NA, NA)]

Sat Dec 18 23:16:58 CET 2010

I'm agnostic at this point about the recycling rules
for logical subscripting, but I've been coming around
to thinking that x[logicalSubscript] should only return
the values of x such that the corresponding value
of logicalSubscript is TRUE.  Values in logicalSubscript
of NA and FALSE should be treated the same: the
corresponding value of x should not be put into the
output subset.  I know this change will break low-level
tests, but I suspect it will make more user-written
code start working than to start breaking when logical
NA's are used in subscripts.

I've asked people why they use the idiom
    x[which(condition)]
instead of the simpler
    x[condition]
Some new users don't know the simpler one works, but
more experienced users say they use which() because
it treats NA's the same a FALSES.

I've heard the same response when I ask about using
subset(dataFrame,condition) instead of dataFrame[condition,].
subset() uses non-standard evaluation rules that
leads to convoluted code involving substitute() and the
like when you want to use it in a general function,
but people use it in part because it treats logical NA's
as though they were FALSE's.

I sometimes use an is.true() function in subscript expressions
   is.true <- function(x) !is.na(x) & x
   vec[is.true(condition)]  
but it seems like a waste of time (and it gets confused
with the isTRUE() function).

A separate but related point is that using logical NA as a
subscript to vector with names gives a nonoptimal result:
  > c(one=1,two=2,three=3,four=4)[c(TRUE,NA,FALSE,TRUE)]
   one <NA> four 
     1   NA    4  
Why isn't the second element of the result called "two"?
As it stands we only know that there was an NA in the subscript,
somewhere between the first and fourth element.

An unrelated point concerns the builtin constants NA, NA_integer_,
NA_real_, etc, where all modes of NA's are generally printed as
just NA.  This seems to lead to misleading tests, like
   x[NA] # integer or logical NA?
but when analyzing data I think you rarely use a typed-in NA
in an expression.  The NA's generally come from data you have
read in from an external source (where NA is rarely used to
indicate missing values) and you always have to make sure
the data columns of imported data have the expected types.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Duncan Murdoch
> Sent: Saturday, December 18, 2010 9:24 AM
> To: Radford Neal
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] Surprising behavior of letters[c(NA, NA)]
> 
> On 18/12/2010 9:12 AM, Radford Neal wrote:
> > Duncan Murdoch writes:
> >
> >    The relevant quote is in the Language Definition, talking about
> >    indices by type of index:
> >
> >    "Logical. The indexing i should generally have the same length as
> >    x. If it is shorter, then its elements will be recycled 
> as discussed
> >    in Section 3.3 [Elementary arithmetic operations], page 
> 14. If it is
> >    longer, then x is conceptually extended with NAs. The 
> selected values
> >    of x are those for which i is TRUE."
> >
> > But this certainly does not justify the actual behaviour.  It says
> > that, for example, (1:3)[NA] should not be a vector of 
> three NAs, but
> > rather a vector of length zero - since NONE of the indexes are TRUE.
> >
> > The actual behaviour of NA in a logical index makes no sense.  It
> > makes sense that NA in an integer index produces an NA in 
> the result,
> > since this NA might correctly express the uncertainty in 
> the value at
> > this position that follows from the uncertainty in the index (and
> > hence produce sensible results in subsequent operations).  
> But NA in a
> > logical index should lead to a result that is of uncertain length.
> > However, R has no mechanism for expressing such uncertainty, so it
> > makes more sense that NA in a logical index should produce an error.
> >
> 
> I agree that the behaviour is not particularly obvious, but 
> I'm not so 
> sure it should produce an error.  We should get an error when 
> the input 
> is likely to be accidental or due to a misconception and the output 
> could be accepted and lead to wrong results later.  I think 
> using an NA 
> in a logical index is probably due to a misconception (e.g. 
> thinking it 
> is an NA_integer_), but the results are so weird that they 
> are unlikely 
> to pass unnoticed.
> 
> And presumably whoever chose this behaviour back in the ancient past 
> thought there was some use in including NA in a logical index, and 
> someone out there in the real world has made use of it.
> 
> But I wouldn't object if R version 3 gave errors for logical index 
> vectors that were the wrong length or that contained NAs.
> 
> Duncan Murdoch
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>