[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Wed Feb 20 19:55:06 CET 2019


Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.




On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 using gmail.com> wrote:

> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
> yesterday)
>
> I believe that I have found another bug, this time in the substring
> function. The use case that I am concerned with is when there is a single
> (character scalar) text/subject, and many substrings to extract. For example
>
> substring("AAAA", 1:4, 1:4)
>
> or more generally,
>
> N=1000
> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>
> The problem I observe is that the time complexity is quadratic in N, as
> shown on this figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> I expected the time complexity to be linear in N.
>
> The example above may seem contrived/trivial, but it is indeed relevant to
> a number of packages (rex, rematch2, namedCapture) which provide functions
> that use gregexpr and then substring to extract the text in the captured
> sub-patterns. The figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
> shows the issue: these packages have quadratic time complexity, whereas
> other packages (and the gregexpr function with perl=TRUE after applying the
> patch discussed yesterday) have linear time complexity. I believe the
> problem is the substring function. Source for this figure:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>
> I suspect that a fix can be accomplished by optimizing the implementation
> of substring, for the special case when the text/subject is a single
> element (character scalar). Right now I notice that the substring R code
> uses rep_len so that the text/subject which is passed to the C code is a
> character vector with the same length as the number of substrings to
> extract. Maybe the C code is calling strlen for each of these (identical)
> text/subject elements?
>
> Anyway, it would be useful to have some feedback to make sure this is
> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
> up for an account)
>
> Toby
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list