[R] Regular Expressions

Fri Nov 5 08:29:37 CET 2010

That's perfect! 

Don't know how I missed that.

I want to start playing with some modeling of financial data and the
only format I can download is rather ugly.  So my plan is to use a
series of Regex to extract what I want.

Noticed that you are a Prof. in applied stats.  I'm at UCLA working on
an MS in stats.  My department is fairly flexible, so I'm taking several
finance courses as part of my work.  Currently debating if I want to
graduate with an MS in June, or roll everything into a PhD and be
finished in an extra 1-2 years.

Thanks!

-N

On 11/5/10 12:09 AM, Prof Brian Ripley wrote:
> On Thu, 4 Nov 2010, Noah Silverman wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to use capturing parenthesis in regular
>> expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial,
>> but I can't seem to find the functionality in R.)
>>
>> For example, given the string:    "10 Nov 13.00 (PFE1020K13)"
>>
>> I want to capture the first to digits and then the month abreviation.
>>
>> In perl, this would be
>>
>> /^(\d\d)\s(\w\w\w)\s/
>>
>> Then I have the variables $1 and $1 assigned to the capturing
>> parenthesis.
>>
>> I've found the grep and sub commands in R, but the docs don't
>> indicate any way to capture things.
>>
>> Any suggestions?
>
> Read the the link to ?regexp.  It *does* 'indicate the way to capture
> things'.
>
>      The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
>      previously matched by the Nth parenthesized subexpression of the
>      regular expression.  (This is an extension for extended regular
>      expressions: POSIX defines them only for basic ones.)
>
> and there is an example on the help page for grep():
>
>      ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
>      gsub("([ab])", "\\1_\\1_", "abc and ABC")
>
> In your example
>
> x <- "10 Nov 13.00 (PFE1020K13)"
> regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
> sub(regex, "\\1", x, perl = TRUE)
> sub(regex, "\\2", x, perl = TRUE)
>
> A better way to do this would be something like
>
> regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"
>
> which is also a POSIX extended regexp.
>