[Rd] Apparent bug in summary.data.frame() with columns of Date class and NA's present

Tue Feb 9 00:15:47 CET 2016

Thanks Marc,

It used to be the case that the NA count was stored as a 7th element of the Fivenum+mean summary. This had the side effect that the NAs were displayed using the same format as the other numbers, which was sort of OK for numerics (3.00) but not for class Date (1970-01-04 for three missings???), so Date objects didn't display the NA count at all. This got straightened out at some point, but apparently there's a residual. It could well be that the reason is that we still have

> length(summary(as.Date(0, origin="1970-1-1")))
[1] 6
> length(summary(as.Date(NA, origin="1970-1-1")))
[1] 6

whereas other numerics get one element longer in case of NAs

> length(summary(as.integer(0)))
[1] 6
> length(summary(as.integer(NA)))
[1] 7

but probably the person who fixed it the last time (can't figure out who that was at the moment) need to have a look.

-pd

> On 08 Feb 2016, at 23:03 , Marc Schwartz <marc_schwartz at me.com> wrote:
> 
> Hi all,
> 
> Based upon an exchange with Göran Broström on R-Help today:
> 
>  https://stat.ethz.ch/pipermail/r-help/2016-February/435992.html
> 
> there appears to be a bug in summary.data.frame() in the case where a data frame contains Date class columns that contain NA's and other columns, if present, do not.
> 
> Example, modified from R-Help:
> 
> x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
> x.Date <- as.Date(as.character(x), format = "%Y%m%d")
> 
> DF.Dates <- data.frame(Col1 = x.Date)
> 
>> summary(x.Date)
>        Min.      1st Qu.       Median         Mean      3rd Qu. 
> "1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
>        Max.         NA's 
> "1969-12-28"          "3" 
> 
> 
> # NA's missing from output
>> summary(DF.Dates)
>      Col1           
> Min.   :1881-09-24  
> 1st Qu.:1902-12-04  
> Median :1920-09-10  
> Mean   :1923-04-12  
> 3rd Qu.:1941-01-17  
> Max.   :1969-12-28  
> 
> 
> DF.Dates$x1 <- 1:7
> 
>> DF.Dates
>        Col1 x1
> 1       <NA>  1
> 2 1881-09-24  2
> 3 1909-12-27  3
> 4       <NA>  4
> 5 1931-05-26  5
> 6 1969-12-28  6
> 7       <NA>  7
> 
> # NA's still missing
>> summary(DF.Dates)
>      Col1                  x1     
> Min.   :1881-09-24   Min.   :1.0  
> 1st Qu.:1902-12-04   1st Qu.:2.5  
> Median :1920-09-10   Median :4.0  
> Mean   :1923-04-12   Mean   :4.0  
> 3rd Qu.:1941-01-17   3rd Qu.:5.5  
> Max.   :1969-12-28   Max.   :7.0  
> 
> 
> DF.Dates$x2 <- c(1:6, NA)
> 
> # NA's show if another column has any
>> summary(DF.Dates)
>      Col1                  x1            x2      
> Min.   :1881-09-24   Min.   :1.0   Min.   :1.00  
> 1st Qu.:1902-12-04   1st Qu.:2.5   1st Qu.:2.25  
> Median :1920-09-10   Median :4.0   Median :3.50  
> Mean   :1923-04-12   Mean   :4.0   Mean   :3.50  
> 3rd Qu.:1941-01-17   3rd Qu.:5.5   3rd Qu.:4.75  
> Max.   :1969-12-28   Max.   :7.0   Max.   :6.00  
> NA's   :3                          NA's   :1     
> 
> 
> The behavior appears to occur because summary.Date() assigns an "NAs" attribute internally that contains the count of NA's in the source Date vector:
> 
> x <- summary.default(unclass(object), digits = digits, ...)
> if (m <- match("NA's", names(x), 0)) {
>       NAs <- as.integer(x[m])
>       x <- x[-m]
>       attr(x, "NAs") <- NAs
>   }
> 
> rather than the count being retained as an actual element in the result vector, as in summary.default():
> 
>       nas <- is.na(object)
>       object <- object[!nas]
>       qq <- stats::quantile(object)
>       qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
>       names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
>           "Max.")
>       if (any(nas)) 
>           c(qq, `NA's` = sum(nas))
>       else qq
> 
> 
> This results in an apparent (but not real) error in the value of the variable 'nr' within summary.date.frame(), which is used to set the length of the result created within that function:
> 
>   nr <- if (nv) 
>       max(unlist(lapply(z, NROW)))
>   else 0
> 
> 'nr' is used later in the function to set the length of the initial result vector 'sms', which in turn is assigned back to the result list 'z'.
> 
> In the case of my example above, where the NA's are not printed, 'nr' is 6, rather than 7. 6 is correct, since that is the actual length of the result vector from summary.Date(), even though the printed output of the result, would appear to contain 7 elements, including the NA count, because of the behavior of print.summaryDefault().
> 
> This results in an apparent truncation of the result, with a loss of the "NAs" attribute from summary.Date(), when the result is returned by summary.data.frame().
> 
> If the source vector is numeric, as per my example above, then 'nr' is set to 7 when NA's are present and the result is correctly printed along with the other columns.
> 
> The history of the difference in the manner in which the NA counts are stored in the different summary() methods is not clear and so I am not clear on how to consider a resolution.
> 
> At least three options seem possible and I have not fully thought through the implications of each yet:
> 
> 1. Modify the code that creates and uses 'nr' in summary.data.frame(), to account for the NAs attribute from summary.Date().
> 2. Restore the NAs attribute later in the code, if present in the vector that results from summary.Date().
> 3. Modify the code in summary.Date() so that it mimics the approach in summary.default() relative to storing the NA count.
> 
> It is important to note that summary.POSIXct() has code similar to summary.Date() relative to the handling of NA's.
> 
> In addition, print.summaryDefault() contains checks for both Date and POSIXct classes and outputs accordingly. So the inter-dependencies of the handling of NA's across the methods are notable.
> 
> Thus, since there are likely to be other implications for the choice of resolution that I am not considering here and I am likely to  be missing some nuances here, I defer to others for comments/corrections.
> 
> Thanks and regards,
> 
> Marc Schwartz
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com