[Rd] [R] Why does R replace all row values with NAs

Tue Mar 3 23:17:00 CET 2015

Stephanie,

Actually, it's as.logical that isn't preserving matrix dimensions, because
it coerces to a logical vector:

> x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50)
> dim(as.logical(x))
NULL

~G

On Tue, Mar 3, 2015 at 2:09 PM, Stephanie M. Gogarten <
sdmorris at u.washington.edu> wrote:

>
>
> On 3/3/15 1:26 PM, Hervé Pagès wrote:
>
>>
>>
>> On 03/03/2015 02:28 AM, Martin Maechler wrote:
>>
>>> Diverted from R-help :
>>> .... as it gets into musing about new R language "primitives"
>>>
>>>  William Dunlap <wdunlap at tibco.com>
>>>>>>>>      on Fri, 27 Feb 2015 08:04:36 -0800 writes:
>>>>>>>>
>>>>>>>
>>>      > You could define functions like
>>>
>>>      > is.true <- function(x) !is.na(x) & x
>>>      > is.false <- function(x) !is.na(x) & !x
>>>
>>>      > and use them in your selections.  E.g.,
>>>      >> x <- data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10))
>>>      >> x[is.true(x$c >= 6), ]
>>>      > a  b  c
>>>      > 7   7  8  7
>>>      > 10 10 11 10
>>>
>>>      > Bill Dunlap
>>>      > TIBCO Software
>>>      > wdunlap tibco.com
>>>
>>> Yes; the Matrix package has had these
>>>
>>> is0  <- function(x) !is.na(x) & x == 0
>>> isN0 <- function(x)  is.na(x) | x != 0
>>> is1  <- function(x) !is.na(x) & x   # also == "isTRUE componentwise"
>>>
>>
>> Note that using %in% to block propagation of NAs is about 2x faster:
>>
>>  > x <- sample(c(NA_integer_, 1:10000), 500000, replace=TRUE)
>>  > microbenchmark(as.logical(x) %in% TRUE, !is.na(x) & x)
>> Unit: milliseconds
>>                      expr       min        lq      mean   median        uq
>>   as.logical(x) %in% TRUE  6.034744  6.264382  6.999083  6.29488  6.346028
>>             !is.na(x) & x 11.202808 11.402437 11.469101 11.44848
>> 11.517576
>>        max neval
>>   40.36472 100
>>   11.90916   100
>>
>
> Unfortunately %in% does not preserve matrix dimensions:
>
> > x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50)
> > dim(x)
> [1] 50 10
> > dim(!is.na(x) & x)
> [1] 50 10
> > dim(as.logical(x) %in% TRUE)
> NULL
>
> Stephanie
>
>
>
>>
>>
>>
>>> namespace hidden for a while  [note the comment of the last one!]
>>> and using them for readibility in its own code.
>>>
>>> Maybe we should (again) consider providing some versions of
>>> these with R ?
>>>
>>> The Matrix package also has had fast
>>>
>>> allFalse <- all0 <- function(x) .Call(R_all0, x)
>>> anyFalse <- any0 <- function(x) .Call(R_any0, x)
>>> ##
>>> ## anyFalse <- function(x) isTRUE(any(!x))         ## ~= any0
>>> ## any0 <- function(x) isTRUE(any(x == 0))          ## ~= anyFalse
>>>
>>> namespace hidden as well, already, which probably could also be
>>> brought to base R.
>>>
>>> One big reason to *not* go there (to internal C code) at all with R is
>>> that
>>> S3 and S4 dispatch for '==' ('!=', etc, the 'Compare' group generics)
>>> and 'is.na() have been known and package writers have
>>> programmed methods for these.
>>> To ensure that S3 and S4 dispatch works "correctly" also inside
>>> such new internals is much less easily achieved, and so
>>> such a C-based internal function  is0() would no longer be
>>> equivalent with    !is.na(x) & x == 0
>>> as soon as 'x' is an "object" with a '==', 'Compare' and/or an is.na()
>>> method.
>>>
>>
>> Excellent point. Thank you! It really makes a big difference for
>> developers who maintain a complex hierarchy of S4 classes and methods,
>> when functions like is.true, anyFalse, etc..., which can be expressed in
>> terms of more basic operations like ==, !=, !, is.na, etc..., just work
>> out-of-the-box on objects for which these basic operations are defined.
>>
>> There is conceptually a small set of "building blocks", at least for
>> objects with a vector-like or list-like semantic, that can be used
>> to formally describe the semantic of many functions in base R. This
>> is what the man page for anyNA does by saying:
>>
>>    anyNA implements any(is.na(x))
>>
>> even though the actual implementation differs, but that's ok, as long
>> as anyNA is equivalent to doing any(is.na(x)) on any object for which
>> building block is.na() is implemented.
>>
>> Unfortunately there is no clearly identified set of building blocks
>> in base R. For example, if I want the comparison operations to work
>> on my object, I need to implement ==, >, <, !=, <=, and >= (the
>> 'Compare' group generics) even though it should be enough to implement
>> == and >=, because all the others can be described in terms of these
>> 2 building blocks. unique/duplicated is another example (unique(x) is
>> conceptually x[!duplicated(x)]). And so on...
>>
>> Cheers,
>> H.
>>
>>
>>> OTOH, simple R versions such as your  'is.true',  called 'is1'
>>> inside Matrix maybe optimizable a bit by the byte compiler (and
>>> jit and other such tricks) and still keep the full
>>> semantic including correct method dispatch.
>>>
>>> Martin Maechler, ETH Zurich
>>>
>>>
>>>      > On Fri, Feb 27, 2015 at 7:27 AM, Dimitri Liakhovitski <
>>>      > dimitri.liakhovitski at gmail.com> wrote:
>>>
>>>      >> Thank you very much, Duncan.
>>>      >> All this being said:
>>>      >>
>>>      >> What would you say is the most elegant and most safe way to
>>> solve such
>>>      >> a seemingly simple task?
>>>      >>
>>>      >> Thank you!
>>>      >>
>>>      >> On Fri, Feb 27, 2015 at 10:02 AM, Duncan Murdoch
>>>      >> <murdoch.duncan at gmail.com> wrote:
>>>      >> > On 27/02/2015 9:49 AM, Dimitri Liakhovitski wrote:
>>>      >> >> So, Duncan, do I understand you correctly:
>>>      >> >>
>>>      >> >> When I use x$x<6, R doesn't know if it's TRUE or FALSE, so
>>> it returns
>>>      >> >> a logical value of NA.
>>>      >> >
>>>      >> > Yes, when x$x is NA.  (Though I think you meant x$c.)
>>>      >> >
>>>      >> >> When this logical value is applied to a row, the R says:
>>> hell, I don't
>>>      >> >> know if I should keep it or not, so, just in case, I am
>>> going to keep
>>>      >> >> it, but I'll replace all the values in this row with NAs?
>>>      >> >
>>>      >> > Yes.  Indexing with a logical NA is probably a mistake, and
>>> this is one
>>>      >> > way to signal it without actually triggering a warning or
>>> error.
>>>      >> >
>>>      >> > BTW, I should have mentioned that the example where you
>>> indexed using
>>>      >> > -which(x$c>=6) is a bad idea:  if none of the entries were 6
>>> or more,
>>>      >> > this would be indexing with an empty vector, and you'd get
>>> nothing, not
>>>      >> > everything.
>>>      >> >
>>>      >> > Duncan Murdoch
>>>      >> >
>>>      >> >
>>>      >> >>
>>>      >> >> On Fri, Feb 27, 2015 at 9:13 AM, Duncan Murdoch
>>>      >> >> <murdoch.duncan at gmail.com> wrote:
>>>      >> >>> On 27/02/2015 9:04 AM, Dimitri Liakhovitski wrote:
>>>      >> >>>> I know how to get the output I need, but I would benefit
>>> from an
>>>      >> >>>> explanation why R behaves the way it does.
>>>      >> >>>>
>>>      >> >>>> # I have a data frame x:
>>>      >> >>>> x = data.frame(a=1:10,b=2:11,c=c(
>>> 1,NA,3,NA,5,NA,7,NA,NA,10))
>>>      >> >>>> x
>>>      >> >>>> # I want to toss rows in x that contain values >=6. But I
>>> don't want
>>>      >> >>>> to toss my NAs there.
>>>      >> >>>>
>>>      >> >>>> subset(x,c<6) # Works correctly, but removes NAs in c,
>>> understand why
>>>      >> >>>> x[which(x$c<6),] # Works correctly, but removes NAs in c,
>>> understand
>>>      >> why
>>>      >> >>>> x[-which(x$c>=6),] # output I need
>>>      >> >>>>
>>>      >> >>>> # Here is my question: why does the following line
>>> replace the values
>>>      >> >>>> of all rows that contain an NA # in x$c with NAs?
>>>      >> >>>>
>>>      >> >>>> x[x$c<6,]  # Leaves rows with c=NA, but makes the whole
>>> row an NA.
>>>      >> Why???
>>>      >> >>>> x[(x$c<6) | is.na(x$c),] # output I need - I have to be
>>>      >> super-explicit
>>>      >> >>>>
>>>      >> >>>> Thank you very much!
>>>      >> >>>
>>>      >> >>> Most of your examples (except the ones using which()) are
>>> doing logical
>>>      >> >>> indexing.  In logical indexing, TRUE keeps a line, FALSE
>>> drops the
>>>      >> line,
>>>      >> >>> and NA returns NA.  Since "x$c < 6" is NA if x$c is NA,
>>> you get the
>>>      >> >>> third kind of indexing.
>>>      >> >>>
>>>      >> >>> Your last example works because in the cases where x$c is
>>> NA, it
>>>      >> >>> evaluates NA | TRUE, and that evaluates to TRUE.  In the
>>> cases where
>>>      >> x$c
>>>      >> >>> is not NA, you get x$c < 6 | FALSE, and that's the same as
>>> x$c < 6,
>>>      >> >>> which will be either TRUE or FALSE.
>>>      >> >>>
>>>      >> >>> Duncan Murdoch
>>>      >> >>>
>>>      >> >>
>>>      >> >>
>>>      >> >>
>>>      >> >
>>>      >>
>>>      >>
>>>      >>
>>>      >> --
>>>      >> Dimitri Liakhovitski
>>>      >>
>>>      >> ______________________________________________
>>>      >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>>> see
>>>      >> https://stat.ethz.ch/mailman/listinfo/r-help
>>>      >> PLEASE do read the posting guide
>>>      >> http://www.R-project.org/posting-guide.html
>>>      >> and provide commented, minimal, self-contained, reproducible
>>> code.
>>>      >>
>>>
>>>      > [[alternative HTML version deleted]]
>>>
>>>      > ______________________________________________
>>>      > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>      > https://stat.ethz.ch/mailman/listinfo/r-help
>>>      > PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>      > and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

	[[alternative HTML version deleted]]