[Rd] [patch] Use JIT for PCRE pattern matching

Mikko Korpela mikko.korpela at aalto.fi
Sat Nov 28 13:50:05 CET 2015


According to ?pcre_config, just-in-time compilation support in the
PCRE library <http://pcre.org/> is "desirable for speed". However, it
seems that the pattern matching functions defined in src/main/grep.c
make no effort to utilize the possible JIT support. Therefore it
appears that currently R does not benefit from JIT support in PCRE.

The attached patch is an attempt to enable JIT in functions using the
PCRE library. It was written by following instructions on the
"pcrejit" man page
<http://pcre.org/original/doc/html/pcrejit.html#SEC4>. The patch also
fixes what I think is an issue of wrong nesting which seems to have
prevented pcre_study() from ever running in grep(l).

I tested the patch with the following code. It defines a vector of 11
strings which is just enough to trigger the use of pcre_study() which
now also runs the JIT compiler. It includes a very long string to
highlight another issue: deep recursion and possible segmentation
faults from PCRE (pcre_exec).

# Initialization code
pattern <- "([^[:alpha:]]|a|b)+"
long_string <- paste0(rep("a", 1023), collapse="")
longer_string <- paste0(rep("a", 1024), collapse="")
longest_string <- paste0(rep("a", 1e7), collapse="")
strings <- c(longest_string, longer_string, long_string, rep("a", 8))
# Calls to functions using PCRE
grepl(pattern, strings, perl = TRUE)
# [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# (output with R-devel r69712 using the attached patch,
#  PCRE 8.38 with JIT enabled, Ubuntu 14.04 64-bit on x86_64)
grep(pattern, strings, perl = TRUE)
# [1]  3  4  5  6  7  8  9 10 11
# (the same comments as above)
foo <- strsplit(strings, pattern, perl = TRUE)
foo <- sub(pattern, "", strings, perl = TRUE)
foo <- gsub(pattern, "", strings, perl = TRUE)
foo <- regexpr(pattern, strings, perl = TRUE)
foo <- gregexpr(pattern, strings, perl = TRUE)

With the patch applied, the code above runs OK, although the example
pattern runs into some capacity limitations of PCRE when the input is
long.

Without the patch, every command above using PCRE (perl = TRUE) will
result in a segmentation fault. In this case, R must be restarted
between segmentation faults although the R prompt reappears after the
segfault. However, the first segmentation fault seems to kill R on
Windows (Windows 10, R 3.2.2). In my tests on Linux and Mac OS X, less 
than 10000 characters of input is enough for a segfault, but about 62000 
were needed on Windows.

Even with the patch, segmentation faults will occur if the character
vector used as input has less than 11 elements.

The cause of the segmentation fault, namely deep recursion and running
out of stack memory, is explained on the "pcrestack" man page
<http://pcre.org/original/doc/html/pcrestack.html>. Apparently this is
considered to be a feature, not a bug. It would be nice if R could
provide some protection against this, but it seems to be a non-trivial
task. Compiling the PCRE library with --disable-stack-for-recursion is
one solution.

-- 
Mikko Korpela
Aalto University School of Science
Department of Computer Science
-------------- next part --------------
Index: src/main/grep.c
===================================================================
--- src/main/grep.c	(revision 69712)
+++ src/main/grep.c	(working copy)
@@ -52,6 +52,10 @@
 # include <config.h>
 #endif
 
+/* Compatibility with PCRE < 8.20 */
+#ifndef PCRE_STUDY_JIT_COMPILE
+# define PCRE_STUDY_JIT_COMPILE 0
+#endif
 
 /* interval at which to check interrupts */
 #define NINTERRUPT 1000000
@@ -405,7 +409,7 @@
 			    errorptr, split+erroffset);
 		error(_("invalid split pattern '%s'"), split);
 	    }
-	    re_pe = pcre_study(re_pcre, 0, &errorptr);
+	    re_pe = pcre_study(re_pcre, PCRE_STUDY_JIT_COMPILE, &errorptr);
 	    if (errorptr)
 		warning(_("PCRE pattern study error\n\t'%s'\n"), errorptr);
 
@@ -482,7 +486,11 @@
 		}
 		vmaxset(vmax2);
 	    }
+#ifdef PCRE_CONFIG_JIT
+	    pcre_free_study(re_pe);
+#else
 	    pcre_free(re_pe);
+#endif
 	    pcre_free(re_pcre);
 	} else if (!useBytes && use_UTF8) { /* ERE in wchar_t */
 	    regex_t reg;
@@ -867,12 +875,12 @@
 		warning(_("PCRE pattern compilation error\n\t'%s'\n\tat '%s'\n"),
 			errorptr, spat+erroffset);
 	    error(_("invalid regular expression '%s'"), spat);
-	    if (n > 10) {
-		re_pe = pcre_study(re_pcre, 0, &errorptr);
-		if (errorptr)
-		    warning(_("PCRE pattern study error\n\t'%s'\n"), errorptr);
-	    }
 	}
+	if (n > 10) {
+	    re_pe = pcre_study(re_pcre, PCRE_STUDY_JIT_COMPILE, &errorptr);
+	    if (errorptr)
+		warning(_("PCRE pattern study error\n\t'%s'\n"), errorptr);
+	}
     } else {
 	int cflags = REG_NOSUB | REG_EXTENDED;
 	if (igcase_opt) cflags |= REG_ICASE;
@@ -929,7 +937,11 @@
 
     if (fixed_opt);
     else if (perl_opt) {
+#ifdef PCRE_CONFIG_JIT
+	if (re_pe) pcre_free_study(re_pe);
+#else
 	if (re_pe) pcre_free(re_pe);
+#endif
 	pcre_free(re_pcre);
 	pcre_free((void *)tables);
     } else
@@ -1623,7 +1635,7 @@
 	    error(_("invalid regular expression '%s'"), spat);
 	}
 	if (n > 10) {
-	    re_pe = pcre_study(re_pcre, 0, &errorptr);
+	    re_pe = pcre_study(re_pcre, PCRE_STUDY_JIT_COMPILE, &errorptr);
 	    if (errorptr)
 		warning(_("PCRE pattern study error\n\t'%s'\n"), errorptr);
 	}
@@ -1910,7 +1922,11 @@
 
     if (fixed_opt) ;
     else if (perl_opt) {
+#ifdef PCRE_CONFIG_JIT
+	if (re_pe) pcre_free_study(re_pe);
+#else
 	if (re_pe) pcre_free(re_pe);
+#endif
 	pcre_free(re_pcre);
 	pcre_free((void *)tables);
     } else tre_regfree(&reg);
@@ -2421,7 +2437,7 @@
 	    error(_("invalid regular expression '%s'"), spat);
 	}
 	if (n > 10) {
-	    re_pe = pcre_study(re_pcre, 0, &errorptr);
+	    re_pe = pcre_study(re_pcre, PCRE_STUDY_JIT_COMPILE, &errorptr);
 	    if (errorptr)
 		warning(_("PCRE pattern study error\n\t'%s'\n"), errorptr);
 	}
@@ -2599,7 +2615,11 @@
 
     if (fixed_opt) ;
     else if (perl_opt) {
+#ifdef PCRE_CONFIG_JIT
+	if (re_pe) pcre_free_study(re_pe);
+#else
 	if (re_pe) pcre_free(re_pe);
+#endif
 	pcre_free(re_pcre);
 	pcre_free((void *)tables);
 	UNPROTECT(1);


More information about the R-devel mailing list