[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

Hans W Borchers hwborchers at googlemail.com
Sun Dec 20 13:22:59 CET 2009


Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> Use a zero lookaround expression.  It will not consume its match.  See ?regexp
> 
> > gregexpr("a(?=a)", "aaa", perl = TRUE)
> [[1]]
> [1] 1 2
> attr(,"match.length")
> [1] 1 1

I wonder how you would count the number of occurrences of, for example,
'aba' or 'a.a' (*) in the string "ababacababab" using simple lookahead?

In Perl, there is a modifier '/g' to do that, and in Python one could
apply the function 'findall'.

When I had this task, I wrote a small function findall(), see below, but
I would be glad to see a solution with lookahead only.

Regards
Hans Werner

(*) or anything more complex

----
    findall <- function(apat, atxt) {
      stopifnot(length(apat) == 1, length(atxt) == 1)
      pos <- c()  # positions of matches
      i <- 1; n <- nchar(atxt)
      found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
      while (found > 0) {
        pos <- c(pos, i + found - 1)
        i <- i + found
        found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
      }
      return(pos)
    }
----

> On Sun, Dec 20, 2009 at 1:43 AM, Jonathan <jonsleepy <at> gmail.com> wrote:
> > Last one for you guys:
> >
> > The command:
> >
> > length(gregexpr('cus','hocus pocus')[[1]])
> > [1] 2
> >
> > returns the number of times the substring 'cus' appears in 'hocus pocus'
> > (which is two)
> >
> > It's returning the number of **disjoint** matches.  So:
> >
> > length(gregexpr('aa','aaa')[[1]])
> >  [1] 1
> >
> > returns 1.
> >
> > **What I want to do:**
> > I'm looking for a way to count all occurrences of the substring, including
> > overlapping sets (so 'aa' would be found in 'aaa' two times, because the
> > middle 'a' gets counted twice).
> >
> > Any ideas would be much appreciated!!
> >
> > Signing off and thanks for all the great assistance,
> > Jonathan
> 
>




More information about the R-help mailing list