[R] Regular expressions: bug or misunderstanding?

Duncan Murdoch murdoch at stats.uwo.ca
Sun Jul 6 23:17:04 CEST 2008


I'm trying to write a gsub() call that takes a string and escapes all 
the unescaped quote marks in it.  So the string

\"

would be left unchanged, but

\\"

would be changed to

\\\"

because the double backslash doesn't act as an escape for the quote, the 
first just escapes the second.  I have the usual problems of writing 
regular expressions involving backslashes which make everything I write 
completely unreadable, so I'm going to change the problem for this 
post:  I will define E to be the escape character, and q to be the 
quote; the gsub() call would leave

Eq

unchanged, but would change

EEq

to EEEq, etc.

The expression I have come up with after this change is

gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)

i.e. "(start of line, or non-escape, followed by an even number of 
escapes), all of which we call expression 1, followed by a quote, is 
replaced by expression 1 followed by an escape and a quote".

This works sometimes, but not always:

 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
[1] "Eq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
[1] "EEEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
[1] "EqaEq"
 > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
[1] "qEq"

Notice that in the final example, the first quote doesn't get escaped.  
Why not????

Duncan Murdoch



More information about the R-help mailing list