[R] Do grep() and strsplit() use different regex engines?

Sun Jul 12 01:12:41 CEST 2015

omigosh -- you're right.

-- Bert
Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll

On Sat, Jul 11, 2015 at 3:31 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote:
>
>> David/Jeff:
>>
>> Thank you both.
>>
>> You seem to confirm that my observation of an "infelicity" in
>> strsplit() is real. That is most helpful.
>>
>> I found nothing in David's message 2 code that was surprising. That
>> is, the splits shown conform to what I would expect from "\\b" . But
>> not to what I originally showed and David enlarged upon in his first
>> message. I still don't really get why a split should occur at every
>> letter.
>>
>> Jeff may very well have found the explanation, but I have not gone
>> through his code.
>>
>> If the infelicities noted (are there more?) by David and me are not
>> really bugs -- and I would be frankly surprised if they were -- I
>> would suggest that perhaps they deserve mention in the strsplit() man
>> page. Something to the effect that "\b and \< should not be used as
>> split characters..." .
>
> It's more of a regex infelicity or what appears (to us both at a minimum)  as a violation of a 'least surprise principle':
>
>>  gsub("\\b", " ", "  This is a test case")
> [1] "     T h i s   i s   a   t e s t   c a s e "
>
>
> --
> David.
>
>>
>> Bert Gunter
>>
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>
>> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
>> <dwinsemius at comcast.net> wrote:
>>>
>>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>>>
>>>> I noticed the following:
>>>>
>>>>> strsplit("red green","\\b")
>>>> [[1]]
>>>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>>>
>>> After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:
>>>
>>> [[1]]
>>> [1] "red"  " "  "green"
>>>
>>> There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"
>>>
>>> I got the expected result with only one of "\\>" and "\\<"
>>>
>>>> strsplit("red green","\\<")
>>> [[1]]
>>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>>>
>>>> strsplit("red green","\\>")
>>> [[1]]
>>> [1] "red"    " green"
>>>
>>> The result with "\\<" seems decidedly unexpected.
>>>
>>> I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
>>> =======
>>> An assertion-character can be any of the following:
>>>
>>>        • < – Beginning of word
>>>        • > – End of word
>>>        • b – Word boundary
>>>        • B – Non-word boundary
>>>        • d – Digit character (equivalent to [[:digit:]])
>>>        • D – Non-digit character (equivalent to [^[:digit:]])
>>>        • s – Space character (equivalent to [[:space:]])
>>>        • S – Non-space character (equivalent to [^[:space:]])
>>>        • w – Word character (equivalent to [[:alnum:]_])
>>>        • W – Non-word character (equivalent to [^[:alnum:]_])
>>> ========
>>>
>>> The word-"word" appears nowhere else on that page.
>>>
>>>
>>>>> strsplit("red green","\\W")
>>>> [[1]]
>>>> [1] "red"   "green"
>>>
>>> `\W` matches the byte-width non-word characters. So the " "-character would be discarded.
>>>
>>>>
>>>> I would have thought that "\\b" should give what "\\W" did. Note that:
>>>>
>>>>> grep("\\bred\\b","red green")
>>>> [1] 1
>>>> ## as expected
>>>>
>>>> Does strsplit use a different regex engine than grep()? Or more
>>>> likely, what am I misunderstanding?
>>>>
>>>> Thanks.
>>>>
>>>> Bert
>>>>
>>>>
>>>
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>
> David Winsemius
> Alameda, CA, USA
>