[R] Do grep() and strsplit() use different regex engines?

Sat Jul 11 20:05:12 CEST 2015

On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:

> I noticed the following:
> 
>> strsplit("red green","\\b")
> [[1]]
> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"

After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:

[[1]]
[1] "red"  " "  "green"

There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"

I got the expected result with only one of "\\>" and "\\<"

> strsplit("red green","\\<")
[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"

> strsplit("red green","\\>")
[[1]]
[1] "red"    " green"

The result with "\\<" seems decidedly unexpected.

I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
=======
An assertion-character can be any of the following:

	• < – Beginning of word
	• > – End of word
	• b – Word boundary
	• B – Non-word boundary
	• d – Digit character (equivalent to [[:digit:]])
	• D – Non-digit character (equivalent to [^[:digit:]])
	• s – Space character (equivalent to [[:space:]])
	• S – Non-space character (equivalent to [^[:space:]])
	• w – Word character (equivalent to [[:alnum:]_])
	• W – Non-word character (equivalent to [^[:alnum:]_])
========

The word-"word" appears nowhere else on that page.

>> strsplit("red green","\\W")
> [[1]]
> [1] "red"   "green"

`\W` matches the byte-width non-word characters. So the " "-character would be discarded.

> 
> I would have thought that "\\b" should give what "\\W" did. Note that:
> 
>> grep("\\bred\\b","red green")
> [1] 1
> ## as expected
> 
> Does strsplit use a different regex engine than grep()? Or more
> likely, what am I misunderstanding?
> 
> Thanks.
> 
> Bert
> 
> 

David Winsemius
Alameda, CA, USA