[R] Dealing with NA in "tbl_df"?

Sun Mar 22 00:31:41 CET 2015

Greetings.  I was reading through the vignette for "tidy-data" (from the
"tidyr" package) and came across something that puzzled me.

One of the examples in the vignette uses a data set related to tuberculosis,
originally from the World Health Organization, but also available at:

  https://github.com/hadley/tidy-data/blob/master/data/tb.csv

Here's the code:

+++++

> library(dplyr)  #### for tbl_df
> library(tidyr)  #### for gather
> tb <- tbl_df(read.csv("tb.csv", stringsAsFactors=FALSE))

> tb2 <- tb %>%
+     gather(demo, n, -iso2, -year, na.rm=TRUE)

> str(tb2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...
>

-----

I thought it might be interesting to see how to do this using the "reshape2"
package.  Here's the code for that:

+++++

library(reshape2)

tb2a <- tb %>%
    melt(
        id.vars=c("iso2", "year"),
        variable.name="demo",
        value.name="n",
        na.rm=TRUE)
tb2a <- tbl_df(tb2a)

> str(tb2a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels "m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...
>

-----

The "str" results make it appear that I'm on the right track, but it's always
good to double check:

+++++

> all.equal(tb2, tb2a)
[1] "Rows in x but not y: 34659, 34658, 34656, 34655, 34651, 34650, 34649,
34648, 34647, 34646, 32264[...]Rows in y but not x: 35663, 34658, 34657,
34656, 34655, 34652, 34651, 34650, 34649, 32265, 32264[...]"
>

-----

Hmm.  Not what I'd hoped for, but all the simple, visual tests I did did not
show any differences.  After a little trial and error, I found the place where
the results differ:

+++++

> ROWS <- 2356
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] TRUE
> ROWS <- 2357
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] "Rows in x but not y: 2357Rows in y but not x: 2357"

-----

OK, let's have a look at the spot where things go off the rails:

+++++

> tb2[2357, ]
Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0
> tb2a[2357, ]
Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0
>

-----

Things certainly *look* the same, but:

+++++

> all.equal(tb2[2357, ], tb2a[2357, ])
[1] "Rows in x but not y: 1Rows in y but not x: 1"
>

-----

If you guessed that it's the NA that's the source of the problem, you're
evidently correct:

+++++

> head(which(is.na(tb2[ , "iso2"])))
[1] 2357 2358 2359 2360 2361 2362
>

-----

But I don't understand what the problem is.  The "all.equal" function does
appear to deal appropriately with NA's.  Here's a trivial example:

+++++

> library(pryr)

Attaching package: ‘pryr’

The following object is masked from ‘package:dplyr’:

    %.%

> foo <- c(3, NA, 7)
> bar <- c(3, NA, 7)
> address(foo)  #### note that foo and bar are distinct objects
[1] "0x422c278"
> address(bar)
[1] "0x4953188"
> all.equal(foo, bar)  #### but they're still equal, even with NA
[1] TRUE
>

-----

And just to be sure, I checked that these really are NA's in foo and bar:

+++++

> any(is.na(foo))
[1] TRUE
> any(is.na(bar))
[1] TRUE
>

-----

It finally occurred to me to strip off the extra class attributes and do the
comparison:

+++++

> all.equal(data.frame(tb2), data.frame(tb2a))
[1] TRUE
>

-----

So this is evidently a "solution" to the problem, but I don't know what the
moral of the story is.  If you have any insights, please pass 'em along.

Thanks.

-- Mike