[R] Why do my regular expressions require a double escape \\ to get a literal??
bhh at xs4all.nl
Fri Mar 2 11:00:53 CET 2012
On 02-03-2012, at 09:36, Roey Angel wrote:
> I was recently misfortunate enough to have to use regular expressions to sort out some data in R.
> I'm working on a data file which contains taxonomical data of bacteria in hierarchical order.
> A sample of this file can be generated using:
> tax.data <- read.table(header=F, con <- textConnection('
> G9SS7BA01D15EC Bacteria(100) Cyanobacteria(84) unclassified
> G9SS7BA01C9UIR Bacteria(100) Proteobacteria(94) Alphaproteobacteria(89)
> G9SS7BA01CM00D Bacteria(100) Proteobacteria(99) Alphaproteobacteria(99)
> What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point)
> I assumed that the following command would solve it, but instead I got an error.
> tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x)))
> Error: '\(' is an unrecognized escape in character string starting "\("
> And it doesn't matter if I use perl = TRUE or not.
> To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis:
> tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x)))
> This yields the desired result but I wonder why it does that?
> No other regular expression system I'm used to (e.g. Perl, Shell) works like that.
> I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP.
> I'd appreciate any explanation.
Section "Character vectors" in the R Intro manual.
The regular expression is provided as a string to gsub. In strings there are escape sequences.
To get the \ as a single \ to the regular expression parser it has to be \-ed in the string stage: \\
More information about the R-help