[Rd] bug in strsplit?

Fri May 29 09:49:43 CEST 2009

src/main/character.c:435-438 (do_strsplit) contains the following code:

    for (i = 0; i < tlen; i++)
        if (getCharCE(STRING_ELT(tok, 0)) == CE_UTF8) use_UTF8 = TRUE;
    for (i = 0; i < len; i++)
        if (getCharCE(STRING_ELT(x, 0)) == CE_UTF8) use_UTF8 = TRUE;

since both loops iterate over loop-invariant expressions and statements,
either the loops are redundant, or the fixed index '0' was meant to
actually be the variable i.  i guess it's the latter, hence 'bug?' in
the subject.

it also appears that if *any* element of tok (or x) positively passes
the test, use_UTF8 is set to TRUE;  in such a case, further checks make
no sense.  the following rewrite cuts the inessential computation:

    for (i = 0; i < tlen; i++)
        if (getCharCE(STRING_ELT(tok, i)) == CE_UTF8) {
            use_UTF8 = TRUE;
            break; }
    for (i = 0; i < len; i++)
        if (getCharCE(STRING_ELT(x, i)) == CE_UTF8) {
            use_UTF8 = TRUE;
            break; }

since the pattern is repetitive, the following generic approach would
help (and the macro could possibly be reused in other places):

#define CHECK_CE(CHARACTER, LENGTH, USEUTF8) \
    for (i = 0; i < (LENGTH); i++) \
        if (getCharCE(STRING_ELT((CHARACTER), i)) == CE_UTF8) { \
            (USEUTF8) = TRUE; \
            break; }
CHECK_CE(tok, tlen, use_UTF8)
CHECK_CE(x, len, use_UTF8)

if you like it, i can provide a patch.

vQ