[R] extract date

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Apr 5 13:38:50 CEST 2005


On Tue, 5 Apr 2005, Petr Pikal wrote:

> Dear Prof.Ripley
>
> Thank you for your answer. After some tests and errors I finished
> with suitable extraction function which gives me substatnial
> increase in positive answers.
>
> Nevertheless I definitely need to gain more practice in regular
> expressions, but from the help page I can grasp only easy things. Is
> there any "Regular expressions for dummies" available?

Not that I know of.

One of my sysadmins uses an O'Reilly pocket guide for reference, and the 
O'Reilly `Mastering Regiular Expressions' book is to my mind no better 
than the POSIX standards.

A quick look on Amazon suggests

Sams Teach Yourself Regular Expressions in 10 Minutes SAMS 0672325667

to be highly rated.

>
> Best regards
> Petr Pikal
>
>
> On 5 Apr 2005 at 10:23, Prof Brian Ripley wrote:
>
>> On Tue, 5 Apr 2005, Petr Pikal wrote:
>>
>>> Dear all,
>>>
>>> please, is there any possibility how to extract a date from data
>>> which are like this:
>>
>> Yes, if you delimit all the possibilities.
>>
>>> ....
>>> "Date: Sat, 21 Feb 04 10:25:43 GMT"
>>> "Date: 13 Feb 2004 13:54:22 -0600"
>>> "Date: Fri, 20 Feb 2004 17:00:48 +0000"
>>> "Date: Fri, 14 Jun 2002 16:22:27 -0400"
>>> "Date: Wed, 18 Feb 2004 08:53:56 -0500"
>>> "Date: 20 Feb 2004 02:18:58 -0600"
>>> "Date: Sun, 15 Feb 2004 16:01:19 +0800"
>>> ....
>>>
>>> I used
>>>
>>> strptime(paste(substr(x,12,13), substr(x,15,17), substr(x,19,22),
>>> sep="-"), format="%d-%b-%Y")
>>>
>>> which suits to lines 3:5 and 7 (such are the most common in my
>>> dataset) but obviously does not work with other lines.
>>
>> For those examples, in character vector 'dates' (without quotes):
>>
>>> nd <- gsub("^[^0-9]*([0-9]+) ([A-Za-z]+) ([0-9]+).*",
>>               "\\1 \\2 \\3", dates)
>>> strptime(nd, "%d %b %y")
>> [1] "2004-02-21" "2020-02-13" "2020-02-20" "2020-06-14" "2020-02-18"
>> [6] "2020-02-20" "2020-02-15"
>>
>> You should be able to amend the regexp for a wider range of forms, but
>> your first line is ambiguous (2004 or 2021?) so there are limits.
>>
>>> If there is no stightforward solution I can live with what I use now
>>> but some automagical function like
>>>
>>> give.me.date.from.my.string.regardles.of.formating(x)
>>> would be great.
>>
>> It would be impossible: when Americans write 07/04/2004 they do not
>> mean April 7th.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list