[Rd] Change in grep behavior from 1.9.0 to R-patched

Martin Maechler maechler at stat.math.ethz.ch
Fri Jun 11 17:21:43 CEST 2004


>>>>> "Roger" == Roger D Peng <rpeng at jhsph.edu>
>>>>>     on Fri, 11 Jun 2004 10:43:57 -0400 writes:

    Roger> I've noticed a change in the way grep() behaves between the 1.9.0 
    Roger> release and a recent R-patched.  On 1.9.0 I get the following output:

    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 84

    Roger> And on R-patched (2004-06-11) I get

    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 13

I can reproduce this exactly.

    <....>

    Roger> I didn't find anything in the NEWs file that would indicate a change 

yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
     library was upgraded, and since we assumed that wouldn't
     have any effect --- as we now see, a too optimistically ---
     it wasn't documented in NEWS

    Roger> and another problem is that I'm not sure which behavior is correct. 
    Roger> My knowledge of regular expressions is limited.

The first one is correct I think: '\w' means word constituents
(see below) and for 1.9.0, 
you get

 > grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
  [1] "l1pm10tmean"  "l1pm25tmean"  "l1cotmean"    "l1no2tmean"   "l1so2tmean"  
  [6] "l1o3tmean"    "l2pm10tmean"  "l2pm25tmean"  "l2cotmean"    "l2no2tmean"  
 [11] "l2so2tmean"   "l2o3tmean"    "l3pm10tmean"  "l3pm25tmean"  "l3cotmean"   
 [16] "l3no2tmean"   "l3so2tmean"   "l3o3tmean"    "l4pm10tmean"  "l4pm25tmean" 
 [21] "l4cotmean"    "l4no2tmean"   "l4so2tmean"   "l4o3tmean"    "l5pm10tmean" 
 [26] "l5pm25tmean"  "l5cotmean"    "l5no2tmean"   "l5so2tmean"   "l5o3tmean"   
 [31] "l6pm10tmean"  "l6pm25tmean"  "l6cotmean"    "l6no2tmean"   "l6so2tmean"  
 [36] "l6o3tmean"    "l7pm10tmean"  "l7pm25tmean"  "l7cotmean"    "l7no2tmean"  
 [41] "l7so2tmean"   "l7o3tmean"    "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean"  
 [46] "lm1no2tmean"  "lm1so2tmean"  "lm1o3tmean"   "lm2pm10tmean" "lm2pm25tmean"
 [51] "lm2cotmean"   "lm2no2tmean"  "lm2so2tmean"  "lm2o3tmean"   "lm3pm10tmean"
 [56] "lm3pm25tmean" "lm3cotmean"   "lm3no2tmean"  "lm3so2tmean"  "lm3o3tmean"  
 [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean"   "lm4no2tmean"  "lm4so2tmean" 
 [66] "lm4o3tmean"   "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean"   "lm5no2tmean" 
 [71] "lm5so2tmean"  "lm5o3tmean"   "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean"  
 [76] "lm6no2tmean"  "lm6so2tmean"  "lm6o3tmean"   "lm7pm10tmean" "lm7pm25tmean"
 [81] "lm7cotmean"   "lm7no2tmean"  "lm7so2tmean"  "lm7o3tmean"  
 > 

which is correct AFAICS and shouldn't be shorted to the only 13 elements

> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
 [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean"   "l1no2tmean"  "l1so2tmean" 
 [6] "l1o3tmean"   "l2pm10tmean" "l2pm25tmean" "l2cotmean"   "l2no2tmean" 
[11] "l2so2tmean"  "l2o3tmean"   "l3pm10tmean"

in R-patched.

------------

For me,  'man perlre' contains

>>         \w  Match a "word" character (alphanumeric plus "_")

         <......>

>>     A "\w" matches a single alphanumeric character or "_", not a whole
>>     word.  Use "\w+" to match a string of Perl-identifier characters (which
>>     isn't the same as matching an English word).  If "use locale" is in
>>     effect, the list of alphabetic characters generated by "\w" is taken
>>     from the current locale.  See the perllocale manpage. .......

so it may well be connected to locale problems.  But I don't
think any locale should  have   
 "l2pm25tmean" matched by  '^l\w+tmean'   but not match
 "lm5pm25tmean"

[If making a difference between these two, it should rather be
 the other way round].

Martin Maechler



More information about the R-devel mailing list