[R] Regular expressions: bug or misunderstanding?

Duncan Murdoch murdoch at stats.uwo.ca
Mon Jul 7 01:29:16 CEST 2008


On 06/07/2008 5:37 PM, (Ted Harding) wrote:
> On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
>> I'm trying to write a gsub() call that takes a string and escapes all 
>> the unescaped quote marks in it.  So the string
>>
>> \"
>>
>> would be left unchanged, but
>>
>> \\"
>>
>> would be changed to
>>
>> \\\"
>>
>> because the double backslash doesn't act as an escape for the quote,
>> the first just escapes the second.  I have the usual problems of
>> writing regular expressions involving backslashes which make
>> everything I write completely unreadable, so I'm going to change
>> the problem for this post:  I will define E to be the escape
>> character, and q to be the quote; the gsub() call would leave
>>
>> Eq
>>
>> unchanged, but would change
>>
>> EEq
>>
>> to EEEq, etc.
>>
>> The expression I have come up with after this change is
>>
>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>>
>> i.e. "(start of line, or non-escape, followed by an even number of 
>> escapes), all of which we call expression 1, followed by a quote,
>> is replaced by expression 1 followed by an escape and a quote".
>>
>> This works sometimes, but not always:
>>
>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
>> [1] "Eq"
>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
>> [1] "EEEq"
>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
>> [1] "EqaEq"
>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
>> [1] "qEq"
>>
>> Notice that in the final example, the first quote doesn't get escaped. 
>> Why not????
> 
> I think (without having done the "experimental diagnostics")
> that it's because in "qq" the first q mtaches (^|[^E]) because
> it matches [^E] (i.e. is a "non-escape"); since it is followed
> by q, it is the second q which gets the escape. Possibly you
> need to include "^q" as an additional alternative match at the
> start of the line.

Thanks, that sounds right, but now I can't see how to fix it.  Is there 
syntax to say:  match A only if it follows B, but don't match the B part?

Duncan Murdoch



More information about the R-help mailing list