[R] strsplit("dia ma", "\\b") splits characterwise

Gabor Grothendieck ggrothendieck at gmail.com
Thu Jul 8 15:33:33 CEST 2010


On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono
<suharto_anggono at yahoo.com> wrote:
> \b is word boundary.
> But, unexpectedly, strsplit("dia ma", "\\b") splits character by character.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>
> How can that be?
>
> This is the output of 'gregexpr'.
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 2 3 4 5 6
> attr(,"match.length")
> [1] 0 0 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>
> The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 'strsplit' to split at that points.

You can use strapply in the gsubfn function to match all words and non-words:

library(gsubfn)
strapply("dia ma", "\\w+|\\W+", c)     # c("dia", " ", "ma")

or all spaces and non-spaces:

strapply("dia ma", "\\s+|\\S+", c)     # c("dia", " ", "ma")



More information about the R-help mailing list