[R] sub/grep question: extract year

Enrico Schumann e@ @end|ng |rom enr|co@chum@nn@net
Thu Aug 9 12:14:52 CEST 2018


Quoting Marc Girondot via R-help <r-help using r-project.org>:

> Hi everybody,
>
> I have some questions about the way that sub is working. I hope that  
> someone has the answer:
>
> 1/ Why the second example does not return an empty string ? There is  
> no match.
>
> subtext <- "-1980-"
> sub(".*(1980).*", "\\1", subtext) # return 1980
> sub(".*(1981).*", "\\1", subtext) # return -1980-

This is as documented in ?sub:
    "Elements of character vectors x which are not
     substituted will be returned unchanged"

> 2/ Based on sub documentation, it replaces the first occurence of a  
> pattern: why it does not return 1980 ?
>
> subtext <- " 1980 1981 "
> sub(".*(198[01]).*", "\\1", subtext) # return 1981

Because the pattern matches the whole string,
not just the year:

     regexpr(".*(198[01]).*", subtext)
     ## [1] 1
     ## attr(,"match.length")
     ## [1] 11
     ## attr(,"useBytes")
     ## [1] TRUE

 From this match, the RE engine will give you the last backreference-match,
which is "1981". If you want to _extract_ the first year, use a  
non-greedy RE instead:

     sub(".*?(198[01]).*", "\\1", subtext)
     ## [1] "1980"

I say _extract_ because you may _replace_ the pattern, as expected:

     sub("198[01]", "YYYY", subtext)
     ## [1] " YYYY 1981 "

That is because the pattern does not match the whole string.
Perhaps this example makes it clearer:

     test <- "1 2 3 4 5"
     sub("([0-9])", "\\1\\1", test)
     ## [1] "11 2 3 4 5"
     sub(".*([0-9]).*", "\\1\\1", test)
     ## [1] "55"
     sub(".*?([0-9]).*", "\\1\\1", test)
     ## [1] "11"



> 3/ I want extract year from text; I use:
>
> subtext <- "bla 1980 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 1980
> subtext <- "bla 2010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 2010
>
> but
>
> subtext <- "bla 1010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 1010
>
> I would like exclude the case 1010 and other like this.
>
> The solution would be:
>
> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>
> Is there a solution to write such a pattern in grep ?

You answered this yourself, I think.


> Thanks a lot
>
> Marc
>


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net




More information about the R-help mailing list