[R] Checking for invalid dates: Code works but needs improvement

Marc Schwartz marc_schwartz at me.com
Thu Jan 26 18:59:09 CET 2012


Paul,

I have a partial solution for you. It is partial in that I have not quite figured out the correct incantation to convert a 5 digit year (eg. 11/23/21931) properly using the R date functions. According to various sources (eg. man strptime and man strftime) as well as the R help for both functions, there are extended formats available, but I am having a bout of cerebral flatulence in getting them to work correctly and a search has not been fruitful. Perhaps someone else can offer some insights.

That being said, with the exception of correctly handling that one situation, which arguably IS a valid date a long time in the future and which would otherwise result in a truncated year (first four digits only)

> as.Date("11/23/21931", format = "%m/%d/%Y")
[1] "2193-11-23"

Here is one approach:

# Check the date. If as.Date() fails or the input is > 10 characters return it
checkDate <- function(x) as.character(x[is.na(as.Date(x, format = "%m/%d/%Y")) | 
                                        nchar(as.character(x)) > 10])

> lapply(TestDates[, -1], checkDate)
$birthDT
[1] "11/23/21931" "06/31/1933" 

$diagnosisDT
[1] "02/30/2010"

$metastaticDT
[1] "un/17/2011" "07/un/2011" "11/20/un"  


You could fine tune the checkDate() function to handle other formats, etc.

HTH,

Marc Schwartz


On Jan 26, 2012, at 9:54 AM, Paul Miller wrote:

> Sorry, sent this earlier but forgot to add an informative subject line. Am resending, in the hopes of getting further replies. My apologies. Hope this is OK.
> 
> Paul
> 
> 
> Hi Rui,
> 
> Thanks for your reply to my post. My code still has various shortcomings but at least now it is fully functional.
> 
> It may be that, as I transition to using R, I'll have to live with some less than ideal code, at least at the outset. I'll just have to write and re-write my code as I improve.
> 
> Appreciate your help.
> 
> Paul
> 
> 
> Message: 66
> Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST)
> From: Rui Barradas <ruipbarradas at sapo.pt>
> To: r-help at r-project.org
> Subject: Re: [R] Checking for invalid dates: Code works but needs
>    improvement
> Message-ID: <1327427697928-4324533.post at n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Hello,
> 
> Point 3 is very simple, instead of 'print' use 'cat'.
> Unlike 'print' it allows for several arguments and (very) simple formating.
> 
>  { cat("Error: Invalid date values in", DateNames[[i]], "\n",
>               TestDates[DateNames][[i]][TestDates$Invalid==1], "\n") }
> 
> Rui Barradas
> 
> Message: 53
> Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST)
> From: Paul Miller <pjmiller_57 at yahoo.com>
> To: r-help at r-project.org
> Subject: [R] Checking for invalid dates: Code works but needs
>    improvement
> Message-ID:
>    <1327424089.1149.YahooMailClassic at web161604.mail.bf1.yahoo.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Hello Everyone,
> 
> Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it.
> 
> Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are "missing" in the sense that the month or year are unknown and prints out any remaining invalid date values. 
> 
> As I see it, the code has at least 4 shortcomings.
> 
> 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines.
> 
> 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector.
> 
> 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like:
> 
> Error: Invalid date values in birthDT
> 
> "21931-11-23" 
> "1933-06-31"
> 
> Error: Invalid date values in diagnosisDT
> 
> "2010-02-30"
> 
> 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop.
> 
> Thanks,
> 
> Paul  
> 
> ##########################################
> #### Code for detecting invalid dates ####
> ##########################################
> 
> #### Test Data ####
> 
> connection <- textConnection("
> 1 11/23/21931 05/23/2009 un/17/2011
> 2 06/20/1940  02/30/2010 03/17/2011
> 3 06/17/1935  12/20/2008 07/un/2011
> 4 05/31/1937  01/18/2007 04/30/2011
> 5 06/31/1933  05/16/2009 11/20/un
> ")
> 
> TestDates <- data.frame(scan(connection, 
>         list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
> 
> close(connection)
> 
> TestDates
> 
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)
> 
> #### List of Date Variables ####
> 
> DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
> 
> #### Read Dates ####
> 
> for (i in seq(TestDates[DateNames])){
> TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
> TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
> TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
> TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
> TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
> TestDates$Day[TestDates$Day=="un"] <- "15"
> TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = "-"))
> is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
> is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
> TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
> TestDates$Invalid <- ifelse(is.na(TestDates$Date) & !is.na(TestDates[DateNames][[i]]), 1, 0)
> if( sum(TestDates$Invalid)==0 ) 
>    { TestDates[DateNames][[i]] <- TestDates$Date } else
>    { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
> TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid))
> }
> 
> TestDates
> 
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)



More information about the R-help mailing list