[R] Do grep() and strsplit() use different regex engines?

Sun Jul 12 00:07:19 CEST 2015

David/Jeff:

Thank you both.

You seem to confirm that my observation of an "infelicity" in
strsplit() is real. That is most helpful.

I found nothing in David's message 2 code that was surprising. That
is, the splits shown conform to what I would expect from "\\b" . But
not to what I originally showed and David enlarged upon in his first
message. I still don't really get why a split should occur at every
letter.

Jeff may very well have found the explanation, but I have not gone
through his code.

If the infelicities noted (are there more?) by David and me are not
really bugs -- and I would be frankly surprised if they were -- I
would suggest that perhaps they deserve mention in the strsplit() man
page. Something to the effect that "\b and \< should not be used as
split characters..." .

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll

On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
<dwinsemius at comcast.net> wrote:
>
> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>
>> I noticed the following:
>>
>>> strsplit("red green","\\b")
>> [[1]]
>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>
> After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:
>
> [[1]]
> [1] "red"  " "  "green"
>
> There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"
>
> I got the expected result with only one of "\\>" and "\\<"
>
>> strsplit("red green","\\<")
> [[1]]
> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>
>> strsplit("red green","\\>")
> [[1]]
> [1] "red"    " green"
>
> The result with "\\<" seems decidedly unexpected.
>
> I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
> =======
> An assertion-character can be any of the following:
>
>         • < – Beginning of word
>         • > – End of word
>         • b – Word boundary
>         • B – Non-word boundary
>         • d – Digit character (equivalent to [[:digit:]])
>         • D – Non-digit character (equivalent to [^[:digit:]])
>         • s – Space character (equivalent to [[:space:]])
>         • S – Non-space character (equivalent to [^[:space:]])
>         • w – Word character (equivalent to [[:alnum:]_])
>         • W – Non-word character (equivalent to [^[:alnum:]_])
> ========
>
> The word-"word" appears nowhere else on that page.
>
>
>>> strsplit("red green","\\W")
>> [[1]]
>> [1] "red"   "green"
>
> `\W` matches the byte-width non-word characters. So the " "-character would be discarded.
>
>>
>> I would have thought that "\\b" should give what "\\W" did. Note that:
>>
>>> grep("\\bred\\b","red green")
>> [1] 1
>> ## as expected
>>
>> Does strsplit use a different regex engine than grep()? Or more
>> likely, what am I misunderstanding?
>>
>> Thanks.
>>
>> Bert
>>
>>
>
>
> David Winsemius
> Alameda, CA, USA
>