[Rd] Change in grep behavior from 1.9.0 to R-patched

Roger D. Peng rpeng at jhsph.edu
Fri Jun 11 17:54:21 CEST 2004


I have the following to environmental variables set:

LANGVAR=en_US.UTF-8
LANG=C

I don't know exactly what both of these mean, but I always 
deliberately set LANG=C in my .tcshrc files since that is necessary to 
get Acrobat Reader working on my Red Hat system.  My guess is they 
were both set this way at build time.

When I run Brian's two alternatives, I *always* get 84, no matter how 
many times I repeat it.  However, when I use \w+, I sometimes get 13 
and sometimes get 84 (say, when repeated 1000 times).

-roger

Prof Brian Ripley wrote:
> This is actually PCRE.  Something is wrong with your build of R-patched
> (1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
> character l, then one or more characters of `word' then tmean.  In your
> example this is the same as (in a suitable locale, including C)
> 
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> 
> which each give 84.
> 
> One issue: PCRE is locale-dependent.  Did you use the same locale for 
> each?  What happens if you force LANG=C?
> 
> (I've just checked an R-devel Solaris system.  This gave 13 on a build 
> from Weds, and 84 when remade today.  The result with 13 seems truncated, 
> as they are the first 13.  Might be coincidental, of course.)
> 
> On Fri, 11 Jun 2004, Roger D. Peng wrote:
> 
> 
>>I've noticed a change in the way grep() behaves between the 1.9.0 
>>release and a recent R-patched.  On 1.9.0 I get the following output:
>>
>> > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>>[1] 84
>>
>>And on R-patched (2004-06-11) I get
>>
>> > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>>[1] 13
>>
>>I can't come up with a simpler example which is why I've posted my 
>>actual character vector on the web (please let me know if there are 
>>problems downloading it).
>>
>>I didn't find anything in the NEWs file that would indicate a change 
> 
> 
> No change is intended and the underlying C code is unchanged.
> 
> 
>>and another problem is that I'm not sure which behavior is correct. 
>>My knowledge of regular expressions is limited.
> 
>



More information about the R-devel mailing list