[Rd] Regular expressions & large strings (PR#6617)

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Feb 28 16:16:15 MET 2004


On Sat, 28 Feb 2004, Mark White wrote:

> Prof Brian Ripley writes:
> > I was able to confirm the error on RH8.0 Linux and the segfault on 
> > Windows.
> > 
> > Note that PCRE is not being used, and if you add perl=TRUE to your [g]sub 
> > calls you get correct results extremely fast.
> 
> Thanks for clarifying that; I hadn't realised.
> 
> > The segfault is occurring in regexec, that is in the GNU regex code 
> > included in R.  I am not clear it is worth spending any time on trying to 
> > find the problem in that code as
> > 
> > - you can use perl=TRUE as an alternative
> > - we will be replacing the GNU regex code in due course to cope with 
> > internationalization issues.
> 
> Sounds fine.  Do you think either of the following are worth
> doing in the meantime?
> 
>   - Add an strsplit() variant with PCRE (perhaps this
>     problem is be related to PR#6601; and the speed might be
>     nice anyway).

Worth considering, as least.

>   - Add options(pcre) so the potentially bad code can be
>     avoided without explicitly setting perl=TRUE every time.

No, as unfortunately the definitions are slightly different and there are
a lot of usages of the POSIX regexps in the base R code (and elsewhere).

I would expect that usages with more than 10000 chars in one string were
rare, and indeed were not supported for most of R's life.  This is yet
another one of those issues where the very limited development resources
come into play.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list