[R] Unexpected behaviour of the as.Date (was: Error as.Date on Invalid Dates)

Greg Snow Greg.Snow at imail.org
Thu Jan 22 22:24:11 CET 2009


Comments interspersed below

From: Marie Sivertsen [mailto:mariesivert at gmail.com] 
Sent: Thursday, January 22, 2009 1:17 PM
To: Greg Snow
Cc: R-help at stat.math.ethz.ch
Subject: Re: [R] Unexpected behaviour of the as.Date (was: Error as.Date on Invalid Dates)

 [snip]


For your question, the help page for as.Date includes:

 "format: A character string.  The default is '"%Y-%m-%d"'.  For
         details see 'strftime'."


To be strict, neither "1/13/2001" nor "13/1/2001" match the format, so both should raise error, I think.  Since the behaviour seem not to apply the default strictly, why ought one think "13/1/2001" will not be parsed the only reasonable way?

 
The help page for as.Date refers to the help page for strptime which says that details are system specific. So there may be some systems where you would get an error from '/' not being '-', but apparently on your system they are treated the same.   Personally I see a big difference between interpreting an obvious separator as such and changing the order of values.  The fact that it sometimes gets the one correct does not imply to me that the other should happen automatically.  

Dealing with the separators can be done on an individual basis as each character string is processed.  Guessing the order of the entries could require looking at the entire vector/file/dataset, which I expect would slow things down quite a bit.  (and how long would it be before someone complained that it processed file A correctly, but file B should have been treated like A, but since it only included days less than 13, the program did not realize this).


And

"Character strings are processed as far as
    necessary for the format specified: any trailing characters are
    ignored."

I don't see anything in your examples that runs counter to the above.


Yes they do.  None of them match the format, but some parse correctly, some produce rubbish, and some raise error.  Maybe you want to improve the help page fo the as.Date to say something like "The default is a sequence of numerical representations of the year, then the month, then the day, separated by one of '-', '/', ...", which make it clearer.
But is it correct? It may be system dependent (or all systems may do the exact same now).  How about if the help page tells you to find out for your system (easy fix, it already does).

Remember that computers do exactly what you tell them to do, not what you think that they should do.


Computers do exactly what they were programmed to do, and what they will do depends on what the developer told them to do when they are given certain input.  I expect them to do exactly what I tell them to do, and it is to parse "1/13/2001" the only reasonable way.  It seems that someone told them to do something else...

I was using the general 'you' above that includes the programmer as well as the user, since you (singular) did not specify the format, the computer used the default format that the programmer (part of the collective 'you') specified which says the order is year, month, day.

Many problems come as a result of users forgetting that they are smarter than the computer.  I see 3 ways to remedy the problem:

1. Make computers that are as smart or smarter than people.
2. Make the programmers anticipate every way that someone may use a particular function and make them implement all of the functionality even if they don't think it is worth the time/effort since there is an easy work around for many of the less likely used features.
3. Don't expect the computer to guess correctly and tell it exactly what you want it to do.

I don't think that number 1 will ever happen, and there are plenty of science fiction stories that suggest problems with even trying.

Option 2 stinks of hubris, and even if it were possible, I personally would not want to wait until they were finished before being able to use the functions/programs.

Which leaves option 3, which I think is the best approach even without arguments against the others.

I think the moral of this story is: program defensively, always specify a date format! 


Mvh.
Marie



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111




More information about the R-help mailing list