[Rd] Change in grep behavior from 1.9.0 to R-patched

Marc Schwartz MSchwartz at MedAnalytics.com
Fri Jun 11 17:46:37 CEST 2004


On Fri, 2004-06-11 at 10:28, Prof Brian Ripley wrote:
> This is actually PCRE.  Something is wrong with your build of R-patched
> (1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
> character l, then one or more characters of `word' then tmean.  In your
> example this is the same as (in a suitable locale, including C)
> 
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> 
> which each give 84.
> 
> One issue: PCRE is locale-dependent.  Did you use the same locale for 
> each?  What happens if you force LANG=C?
> 
> (I've just checked an R-devel Solaris system.  This gave 13 on a build 
> from Weds, and 84 when remade today.  The result with 13 seems truncated, 
> as they are the first 13.  Might be coincidental, of course.)


The above is confirmed using Version 1.9.1 alpha (2004-06-10) on FC2:

> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
[1] 84
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
[1] 84


Also, to demonstrate Roger's follow up example:

> d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value
= TRUE)))
> summary(d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   13.00   13.00   14.14   13.00   84.00 


BTW: pcre-4.5-2

HTH,

Marc Schwartz



More information about the R-devel mailing list