[Rd] \U or \L perl regex in gsub removes text outside capturing group in UTF-8 contexts

Hugh Parsonage hugh.parsonage at gmail.com
Mon Jun 19 13:50:52 CEST 2017


I write to clarify the status of \U and \L when used in the replacement
argument to gsub in R 3.5.0. The behaviour of gsub appears to have changed
from R 3.4.0, but the documentation for the replacement argument has not.


## Reprex (A call to readLines is essential. A url is provided for
convenience but the behaviour should reproduce for local files)


bib <- readLines("
https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib",
encoding = "UTF-8", n = 10)
bib8910 <- bib[8:10]
gsub("(\\w+)", "\\U\\1", bib8910, perl = TRUE)


#> [1] "@TECHREPORT" " AUTHOR" " TITLE"


Expected result (in R 3.4.0):


#> [1] "@TECHREPORT{WOODHUNTEROTOOLEETAL2012,"
#> [2] " AUTHOR = {TONY WOOD AND AMÉLIE HUNTER AND MICHAEL O'TOOLE AND
PRASANA VENKATARAMAN AND LUCY CARTER},"
#> [3] " TITLE = {PUTTING THE CUSTOMER BACK IN FRONT: HOW TO MAKE
ELECTRICITY CHEAPER},"


## Likely point of breaking change
I was alerted on June 13 by Kurt Hornik that my package (TeXCheckR), which
had previously been accepted on CRAN, was ERRORing, as a unit test relies
on \L.




## sessionInfo()

R Under development (unstable) (2017-06-19 r72808)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)


Matrix products: default


locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252


attached base packages:
[1] stats graphics grDevices utils datasets methods base


loaded via a namespace (and not attached):
[1] compiler_3.5.0






Many thanks,




Hugh Parsonage
Associate, Grattan Institute, Melbourne, AU

	[[alternative HTML version deleted]]



More information about the R-devel mailing list