[Rd] strsplit and the empty string

Christian Brechbühler brechbuehler at gmail.com
Wed Jun 18 16:59:39 CEST 2008


On Wed, Jun 18, 2008 at 8:45 AM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> asked
for opinions:
>
> When the pattern
> matches the beginning of the search string, the empty string is added to
> the result, but that's not the case when the pattern matches the end of
> the search string:
>
> strsplit(" hello dolly ")
> [1] "" "hello" "dolly"

With R version 2.6.1 Patched (2007-11-26 r43541), I get
    Error in strsplit(" hello dolly ") :
      argument "split" is missing, with no default

But strsplit(" hello dolly ", " ") reproduces your results.

> The man for strsplit explains the algorithm:
>
> "
>  The algorithm applied to each input string is
>
>
>         repeat {
>             if the string is empty
>                 break.
>             if there is a match
>                 add the string to the left of the match to the output.
>                 remove the match and all to the left of it.
>             else
>                 add the string to the output.
>                 break.
>         }
>
>     Note that this means that if there is a match at the beginning of
>     a (non-empty) string, the first element of the output is '""', but
>     if there is a match at the end of the string, the output is the
>     same as with the match removed.
> "

The algorithm, the comment after it, and your results are consistent.
Whether it is intuitive is a matter of taste.  I agree it's not as
symmetric as one might like.

> If the pattern matches, (second if above), the match is added to the
> output, and removed from the input -- which after this step is the empty
> string;

Close.  The string to the left of the match, "dolly", is added to the output.
I agree, the input is now the empty string.

> in the next step, there is no match (else above), so the rest of
> the input string (= the empty string) *should* be added, but it is not
> what happens.

No, in the next step, the string is empty (first 'if' above), and we break.
The else branch never applies in your example.

> (i see no good
> reason for including the empty string at the beginning but not at the
> end of the output; no other language i know would do that this way)

I checked Perl, and it does exactly the same:
  print join "==", split / /, " hello dolly "
==hello==dolly
(that's 3 elements: "", "hello",  and "dolly").

Cheers,
/Christian



More information about the R-devel mailing list