[Rd] patch for gregexpr(perl=TRUE)

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Tue Feb 19 21:37:39 CET 2019


Hi all,

Several people have noticed that gregexpr is very slow for large subject
strings when perl=TRUE is specified.
-
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
-
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
- https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

I figured out the issue, which is fixed by changing 1 line of code in
src/main/grep.c -- there is a strlen function call which is currently
inside of the while loop over matches, and the patch moves it before the
loop.
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch

I made some figures that show the quadratic time complexity before applying
the patch, and the linear time complexity after applying the patch
https://github.com/tdhock/namedCapture-article#19-feb-2019

I would have posted a bug report on bugs.r-project.org but I do not have an
account. So can an R-devel person please either (1) accept this patch, or
(2) give me an account so I can post the patch on the bug tracker?

Finally I would like to mention that Bill Dunlap noticed a similar problem
(time complexity which is quadratic in subject size) for strsplit with
perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
could be accomplished (because I see that strlen is being called in a while
loop in do_strsplit as well).

Thanks
Toby Dylan Hocking

	[[alternative HTML version deleted]]



More information about the R-devel mailing list