[R] Parsing regular expressions differently - feature request

Duncan Murdoch murdoch at stats.uwo.ca
Sat Nov 8 20:05:49 CET 2008


On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
> On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
>> On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
>>> Hi there,
>>>
>>> I rejoiced when I realized that you can use Perl regex from within R.
>>> However, as the FAQ states "Some functions, particularly those involving
>>> regular expression matching, themselves use metacharacters, which may need
>>> to be escaped by the backslash mechanism. In those cases you may need a
>>> quadruple backslash to represent a single literal one. "
>>>
>>> I was wondering if that is really necessary for perl=TRUE? wouldn't it be
>>> possible to parse a string differently in a regex context, e.g.
>>> automatically insert \\ for each \ , such that you can use the perl syntax
>>> directly? For example, if you want to input a newline as a character, you
>>> would use \n anyway. At the moment one says \\n to make it clear to R that
>>> you mean \n to make clear that you mean newline... this is pretty annoying.
>>> How likely is it that you want to pass a real newline character to PCRE
>>> directly?
>> No, that's not possible.  At the level where the parsing takes place R has
>> no idea of its eventual use, so it can't tell that some strings are going to
>> be interpreted as Perl, and others not.
>>
>> As Gabor mentioned, there have been various discussions of adding a new
>> syntax for strings that are parsed literally, without processing any
>> escapes, but no consensus on the right syntax to use.
>>
>> There are currently some fragile tricks that let you avoid escapes, e.g.
>> using scan() to read a line:
>>
>>> re <- scan(what="", n=1)
>> 1: [^\\]
>> Read 1 item
>>> re
>> [1] "[^\\\\]"
>>
>> (I call this fragile because it works in scripts processed at console level,
>> but not if you type the same thing into a function.)
>>
>> So I agree, it would be nice to have new syntax to allow this.  Last time
>> this came up, I argued for something like \verb in LaTeX where the delimiter
>> could be specified differently in each use.  Duncan TL suggested triple
>> quotes, as in Python.  I think now that triple quotes would be be better
>> than the particular form I suggested.
> 
> Ruby's quoting method looks quite flexible:
> 
> http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes

Thanks, I didn't know about those.  I would have preferred Ruby's option 
to the one I made up when we last had this discussion, but it also 
suffers from the same flaw:  it won't work in Rd files.  There the % 
sign is a comment marker.  Saying that sometimes it's not just makes 
everything more complicated.

So right now I'd have to say that Python-style quotes would be my 
choice.  If you want to put '''""" into your string, you'll be stuck 
using regular quotes and escapes, but I could live with that.

Duncan Murdoch



More information about the R-help mailing list