[R] Parsing regular expressions differently - feature request

Gabor Grothendieck ggrothendieck at gmail.com
Sat Nov 8 20:50:14 CET 2008

On Sat, Nov 8, 2008 at 2:05 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
>>
>> On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca>
>> wrote:
>>>
>>> On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
>>>>
>>>> Hi there,
>>>>
>>>> I rejoiced when I realized that you can use Perl regex from within R.
>>>> However, as the FAQ states "Some functions, particularly those involving
>>>> regular expression matching, themselves use metacharacters, which may
>>>> need
>>>> to be escaped by the backslash mechanism. In those cases you may need a
>>>> quadruple backslash to represent a single literal one. "
>>>>
>>>> I was wondering if that is really necessary for perl=TRUE? wouldn't it
>>>> be
>>>> possible to parse a string differently in a regex context, e.g.
>>>> automatically insert \\ for each \ , such that you can use the perl
>>>> syntax
>>>> directly? For example, if you want to input a newline as a character,
>>>> you
>>>> would use \n anyway. At the moment one says \\n to make it clear to R
>>>> that
>>>> you mean \n to make clear that you mean newline... this is pretty
>>>> annoying.
>>>> How likely is it that you want to pass a real newline character to PCRE
>>>> directly?
>>>
>>> No, that's not possible.  At the level where the parsing takes place R
>>> has
>>> no idea of its eventual use, so it can't tell that some strings are going
>>> to
>>> be interpreted as Perl, and others not.
>>>
>>> As Gabor mentioned, there have been various discussions of adding a new
>>> syntax for strings that are parsed literally, without processing any
>>> escapes, but no consensus on the right syntax to use.
>>>
>>> There are currently some fragile tricks that let you avoid escapes, e.g.
>>> using scan() to read a line:
>>>
>>>> re <- scan(what="", n=1)
>>>
>>> 1: [^\\]
>>>>
>>>> re
>>>
>>> [1] "[^\\\\]"
>>>
>>> (I call this fragile because it works in scripts processed at console
>>> level,
>>> but not if you type the same thing into a function.)
>>>
>>> So I agree, it would be nice to have new syntax to allow this.  Last time
>>> this came up, I argued for something like \verb in LaTeX where the
>>> delimiter
>>> could be specified differently in each use.  Duncan TL suggested triple
>>> quotes, as in Python.  I think now that triple quotes would be be better
>>> than the particular form I suggested.
>>
>> Ruby's quoting method looks quite flexible:
>>
>> http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes
>
> Thanks, I didn't know about those.  I would have preferred Ruby's option to
> the one I made up when we last had this discussion, but it also suffers from
> the same flaw:  it won't work in Rd files.  There the % sign is a comment
> marker.  Saying that sometimes it's not just makes everything more
> complicated.
>
> So right now I'd have to say that Python-style quotes would be my choice.
>  If you want to put '''""" into your string, you'll be stuck using regular
> quotes and escapes, but I could live with that.
>
> Duncan Murdoch
>

One could use a different character.