[R] Regular expressions, genbank

arun smartpink111 at yahoo.com
Thu Feb 6 19:55:41 CET 2014

One way would be: 

vec1 <- c("CDS             3300..4037",  "CDS             complement(3300..4037)", "CDS             3300<..4037", "CDS             join(21467..26641,27577..28890)",  "CDS             complement(join(30708..31700,31931..31984))",  "CDS             3300<..>4037")
as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," ",gsub("\\d+<|>\\d+","",vec1)))," ")))
# [1]  3300  4037  3300  4037  4037 21467 26641 27577 28890 30708 31700 31931
#[13] 31984


I have been using R for the past 1.5 years and usually have 
found topics to be relatively easy to learn on your own, but I am 
finding the learning curve with the regular expressions to be a little 
steep especially since I haven't found any good tutorials. While I 
intend to spend more time systematically learning proper ways of making 
regular expressions, I have a project that is coming due and can't wait 
for that so I was hoping to get some direct help. 
I need to extract all the numbers in lines with following formats: 

"CDS             3300..4037" 
"CDS             complement(3300..4037)" 
"CDS             join(21467..26641,27577..28890)" 
"CDS             complement(join(30708..31700,31931..31984))" 

but not if any of the numbers are preceded by "<" or followed by ">" 
Many thanks in advance!

More information about the R-help mailing list