[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

Carson Sievert cp@|evert1 @end|ng |rom gm@||@com
Tue Jun 9 00:09:54 CEST 2020


Hi everyone,

I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with
R4.0 and above with Unicode characters. Here's a minimal example where I'd
expect to see a start value of `5` (as R 3.6.2 and below gives), but R
4.0.0 (and R 4.0.1) now returns:

```
> regexpr("b", "foo\U0001F937bar", perl = TRUE)
#> [1] 6
#> attr(,"match.length")
#> [1] 1
```

Perhaps this change in behavior could be explained by R4.0's migration to
PCRE2? Here is some relevant output from my R4.0 session:

```
> pcre_config()
#> UTF-8 Unicode properties     JIT    stack
#>  TRUE               TRUE    TRUE    FALSE
```

```
> extSoftVersion()
#>         zlib                        bzlib            xz
   PCRE
#> "1.2.11"   "1.0.8, 13-Jul-2019"    "5.2.4"   "10.33 2019-04-16"
#> ICU                                       TRE            iconv
 readline   BLAS
#> "58.2" "TRE 0.8.0 R_fixes (BSD)"  "win_iconv"               ""       ""
```

Let me know if there's any more information I can provide to help replicate
and isolate the issue. Also, if this happens to be the expected behavior,
I'd be keen to learn about why that's the case.

Thank you,

-Carson

-- 
Carson Sievert, PhD
Software Engineer at RStudio
Website <https://cpsievert.me> | Twitter <https://twitter.com/cpsievert> |
GitHub <https://github.com/cpsievert>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list