[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

Gabor Grothendieck ggrothendieck at gmail.com
Sun Dec 20 13:49:02 CET 2009


Try this:

> findall("aba", "ababacababab")
[1] 1 3 7 9
> gregexpr("a(?=ba)", "ababacababab", perl = TRUE)
[[1]]
[1] 1 3 7 9
attr(,"match.length")
[1] 1 1 1 1

> findall("a.a", "ababacababab")
[1] 1 3 5 7 9
> gregexpr("a(?=.a)", "ababacababab", perl = TRUE)
[[1]]
[1] 1 3 5 7 9
attr(,"match.length")
[1] 1 1 1 1 1


On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers
<hwborchers at googlemail.com> wrote:
> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>>
>> Use a zero lookaround expression.  It will not consume its match.  See ?regexp
>>
>> > gregexpr("a(?=a)", "aaa", perl = TRUE)
>> [[1]]
>> [1] 1 2
>> attr(,"match.length")
>> [1] 1 1
>
> I wonder how you would count the number of occurrences of, for example,
> 'aba' or 'a.a' (*) in the string "ababacababab" using simple lookahead?
>
> In Perl, there is a modifier '/g' to do that, and in Python one could
> apply the function 'findall'.
>
> When I had this task, I wrote a small function findall(), see below, but
> I would be glad to see a solution with lookahead only.
>
> Regards
> Hans Werner
>
> (*) or anything more complex
>
> ----
>    findall <- function(apat, atxt) {
>      stopifnot(length(apat) == 1, length(atxt) == 1)
>      pos <- c()  # positions of matches
>      i <- 1; n <- nchar(atxt)
>      found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
>      while (found > 0) {
>        pos <- c(pos, i + found - 1)
>        i <- i + found
>        found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
>      }
>      return(pos)
>    }
> ----
>
>> On Sun, Dec 20, 2009 at 1:43 AM, Jonathan <jonsleepy <at> gmail.com> wrote:
>> > Last one for you guys:
>> >
>> > The command:
>> >
>> > length(gregexpr('cus','hocus pocus')[[1]])
>> > [1] 2
>> >
>> > returns the number of times the substring 'cus' appears in 'hocus pocus'
>> > (which is two)
>> >
>> > It's returning the number of **disjoint** matches.  So:
>> >
>> > length(gregexpr('aa','aaa')[[1]])
>> >  [1] 1
>> >
>> > returns 1.
>> >
>> > **What I want to do:**
>> > I'm looking for a way to count all occurrences of the substring, including
>> > overlapping sets (so 'aa' would be found in 'aaa' two times, because the
>> > middle 'a' gets counted twice).
>> >
>> > Any ideas would be much appreciated!!
>> >
>> > Signing off and thanks for all the great assistance,
>> > Jonathan
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list