[R] pattern search

Gabor Grothendieck ggrothendieck at myway.com
Sat Oct 30 02:29:27 CEST 2004


Sean Liang <SLiang <at> wyeth.com> writes:
> I have a vector of sequences each of which might contain any number of a
> given pattern (e.g.
> 
> >pat=c("ATCGTTTGCTAC", "GGCTAATGCATTGC");
> > grep ("TGC", pat)
> [1] 1 2
> 
> grep only tells me the position of first occurrence in each element
> whereas the second element contains two "TGC"s. 
[...]
> I like to know the number of
> occurences and the positions if possible. 

The following crates v, a list, the same length as pat, of
vectors representing pat elements split along boundaries of
TGC.  lapply then calculates the starting position of each
element selecting out those that correspond to TGC.  The
sapply at the end calculates the number of matches for each
element of pat.

pat <- c("ATCGTTTGCTAC", "GGCTAATGCATTGC")

# pat split along TGC boundaries
v <- strsplit(gsub("(TGC)", ":\\1:", pat), split = ":+")

# starting positions
lapply(v, function(x) (cumsum(nchar(x)) - nchar("TGC") + 1)[grep("TGC",x)])

# number of matches
sapply(.Last.value, length)




More information about the R-help mailing list