[R] Do grep() and strsplit() use different regex engines?

Bert Gunter bgunter.4567 at gmail.com
Sun Jul 12 04:09:11 CEST 2015


Thanks, Chuck (he says, red-faced).

Maybe I should read the man page more carefully ...!

And as for grep(), similar issues: (from ?grep)

"POSIX 1003.2 mode of gsub and gregexpr does not work correctly with
repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for
such matches (but that may not work as expected with non-ASCII inputs,
as the meaning of ‘word’ is system-dependent)."

And no, I don't think anything needs to be added to ?strsplit. The man
page writers spelled it out clearly. They're not responsible for my
dummheit.

My apologies to all for wasted bandwidth...


Cheers,
Bert

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 4:26 PM, Charles C. Berry <ccberry at ucsd.edu> wrote:
> On Sat, 11 Jul 2015, Bert Gunter wrote:
>
>> David/Jeff:
>>
>> Thank you both.
>>
>> You seem to confirm that my observation of an "infelicity" in
>> strsplit() is real. That is most helpful.
>>
>> I found nothing in David's message 2 code that was surprising. That
>> is, the splits shown conform to what I would expect from "\\b" . But
>> not to what I originally showed and David enlarged upon in his first
>> message. I still don't really get why a split should occur at every
>> letter.
>>
>> Jeff may very well have found the explanation, but I have not gone
>> through his code.
>>
>> If the infelicities noted (are there more?) by David and me are not
>> really bugs -- and I would be frankly surprised if they were -- I
>> would suggest that perhaps they deserve mention in the strsplit() man
>> page. Something to the effect that "\b and \< should not be used as
>> split characters..." .
>
>
> Bert et al,
>
> ?strsplit already says:
>
> "If empty matches occur, in particular if split has length 0, x is split
> into single characters."
>
> And there are various ways that empty matches can happen besides using "\\b"
> as the split arg. But there would be no harm in adding your cases to 'in
> particular ...'
>
> The comment in the code (src/main/grep.c: line 493) suggests this was a
> deliberate decision. However, similar functions in other languages do not do
> this.
>
> For example, emacs `(split-string "red green" "\\b")' gives
>
>         ("" "red" " " "green" "")
>
> as the result.
>
> Chuck



More information about the R-help mailing list