[R] Why do my regular expressions require a double escape \\ to get a literal??

Roey Angel angel at mpi-marburg.mpg.de
Fri Mar 2 09:36:43 CET 2012


Hi,
I was recently misfortunate enough to have to use regular expressions to 
sort out some data in R.
I'm working on a data file which contains taxonomical data of bacteria 
in hierarchical order.
A sample of this file can be generated using:

tax.data <- read.table(header=F, con <- textConnection('
G9SS7BA01D15EC  Bacteria(100)    Cyanobacteria(84)    unclassified
G9SS7BA01C9UIR    Bacteria(100)    Proteobacteria(94)    
Alphaproteobacteria(89)
G9SS7BA01CM00D    Bacteria(100)    Proteobacteria(99)    
Alphaproteobacteria(99)
'))
close(con)

What I try to do is to remove the parenthesis and the number inside 
(which could contain a decimal point)
I assumed that the following command would solve it, but instead I got 
an error.

tax.data <- as.data.frame(apply(tax.data, 2, function(x) 
gsub('\(.*\)','',x)))
Error: '\(' is an unrecognized escape in character string starting "\("

And it doesn't matter if I use perl = TRUE or not.
To solve it I need to use a double escape sign '\\' before opening and 
closing the parenthesis:

tax.data <- as.data.frame(apply(tax.data, 2, function(x) 
gsub('\\(.*\\)','',x)))

This yields the desired result but I wonder why it does that?
No other regular expression system I'm used to (e.g. Perl, Shell) works 
like that.

I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu 
and win XP.

I'd appreciate any explanation.

Thanks in advance,
baffled Roey

-- 
Dr. Roey Angel

Max-Planck-Institute for Terrestrial Microbiology
Karl-von-Frisch-Strasse 10
D-35043 Marburg, Germany

Office: +49 (0)6421/178-832
Mobile: +49 (0)176/612-785-88



More information about the R-help mailing list