[R] Checking for invalid dates: Code works but needs improvement

David Winsemius dwinsemius at comcast.net
Mon Jan 30 19:15:48 CET 2012


On Jan 30, 2012, at 8:44 AM, Paul Miller wrote:

> Hi Rui, Marc, and Gabor,
>
> Thanks for your replies to my question. All were helpful and it was  
> interesting to see how different people approach various aspects of  
> the same problem.
>
> Spent some time this weekend looking at Rui's solution, which is  
> certainly much clearer than my own. Managed to figure out pretty  
> much all the details of how it works. Also managed to tweak it  
> slightly in order to make it do exactly what I wanted. (See revised  
> code below.)
>
> Still have a couple of questions though. The first concerns the  
> insertion of the code "Y > 2012" to set year values beyond 2012 to  
> NA (on line 10 of the function below).  When I add this (or use it  
> in place of "nchar(Y) > 4"), the code succesfully finds the problem  
> date "05/16/2015". After that though, it produces the following  
> error message:
>
> Error in if (any(is.na(x) & M != "un" & Y != "un")) cat("Warning:  
> Invalid date values in",  :  missing value where TRUE/FALSE needed

It's a bit dangerous to use comparison operators on mixed data types.  
In your case you are comparing a character value to a numeric value  
and may not realize that 2015 is not the same as "2015". Try "123" >  
1000 if you want a quick counter-example. You may want to coerce the Y  
value to "numeric" mode to be safe.

Also 'any' does not expect the logical connectives. You probably want:

any(is.na(x) , M != "un" , Y != "un")

>
> Why is this happening? If the code correctly correctly handles the  
> date "06/20/1840" without producing an error, why can't it do  
> likelwise with "05/16/2015"?
>
> The second question is why it's necessary to put "x" on line 15  
> following "cat("Warning ...)". I know that I don't get any date  
> columns if I don't include this but am not sure why.
>
> The third question is whether it's possible to change the class of  
> the date variables without using a for loop. I played around with  
> this a little but didn't find a vectorized alternative. It may be  
> that this is not really important. It's just that I've read in  
> several places that for loops should be avoided wherever possible.
>
> Thanks,
>
> Paul
>
>
> ##########################################
> #### Code for detecting invalid dates ####
> ##########################################
>
> #### Test Data ####
>
> connection <- textConnection("
> 1 11/23/21931 05/23/2009 un/17/2011
> 2 06/20/1840  02/30/2010 03/17/2011
> 3 06/17/1935  12/20/2008 07/un/2011
> 4 05/31/1937  01/18/2007 04/30/2011
> 5 06/31/1933  05/16/2015 11/20/un
> ")
>
> TestDates <- data.frame(scan(connection,
> 		 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
>
> close(connection)
>
> #### Input Data ####
>
> TDSaved <- TestDates
>
> #### List of Date Variables ####
>
> DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
>
> #### Date Function ####
>
> fun <- function(Dat){
>    f <- function(jj, DF){
>        x <- as.character(DF[, jj])
>        x <- unlist(strsplit(x, "/"))
>        n <- length(x)
>        M <- x[seq(1, n, 3)]
>        D <- x[seq(2, n, 3)]
>        Y <- x[seq(3, n, 3)]
>        D[D == "un"] <- "15"
>        Y <- ifelse(nchar(Y) > 4 | Y > 2012 | Y < 1900, NA, Y)
>        x <- as.Date(paste(Y, M, D, sep="-"), format="%Y-%m-%d")
>        if(any(is.na(x) & M != "un" & Y != "un"))
>            cat("Warning: Invalid date values in", jj, "\n",
>                as.character(DF[is.na(x), jj]), "\n")
>        x
>    }
>    Dat <- data.frame(sapply(names(Dat), function(j) f(j, Dat)))
>    for(i in names(Dat)) class(Dat[[i]]) <- "Date"
>    Dat
> }
>
> #### Output Data ####
>
> TD <- TDSaved
>
> #### Read Dates ####
>
> TD[, DateNames] <- fun(TD[, DateNames])
> TD
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT



More information about the R-help mailing list