[Rd] Unicode whitespace

hadley wickham h.wickham at gmail.com
Sun Jan 6 06:32:05 CET 2008


On Jan 5, 2008 1:40 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> I presume you want this only in a UTF-8 locale?

Yes, although my assumption is that this will become an increasing
common locale as time goes by.

> Currently this is done by
>
> static int SkipSpace(void)
> {
>      int c;
>      while ((c = xxgetc()) == ' ' || c == '\t' || c == '\f')
>         /* nothing */;
>      return c;
> }
>
> in gram.c.  We could make use of isspace and its wide-char equivalent
> iswspace.  However:
>
>
> - there is the perennial debate over whether \v is whitespace.
>
> R-lang says
>
>    Although not strictly tokens, stretches of whitespace characters
>    (spaces and tabs) serve to delimit tokens in case of ambiguity,
>
> which suggests it has a minimal view of whitespace.
>
>
> - iswspace is often rather unreliable.  E.g. glibc says
>
>      The wide character class "space" always contains  at  least  the  space
>      character and the control characters '\f', '\n', '\r', '\t', '\v'.
>
> and I think it usually does not contain other forms of spaces.  More
> seriously
>
>      The  behaviour  of  iswspace()  depends on the LC_CTYPE category of the
>      current locale.
>
> so what is a space will depend on the encoding (hence my question about
> UTF-8).  And Ei-ji Makama was replaced iswspace on MacOS X, because
> apparently it is wrongly implemented.
>
>
> - it would complicate the parser as look-ahead would be needed (you would
> need to read the next mbcs, check it it were whitespace and pushback if
> needed).  We do that elsewhere, though.

I had assumed the parser would be unicode/mb aware already and so
would be an easy fix.  The locale issues are clearly important and
can't easily be swept under the rug.

> The only one of these 'spaces' I have much sympathy for is NBSP (which is
> also fairly easy to generate in CP1252).  It would be easy to add that.
> Otherwise I am not convinced it is worth the work (and added uncertainty).

That's reasonable.  Another related request would be treating curly
quotes (single and double) the same way as normal quotes, but I'd
imagine similar caveats would apply there.  You could also imagine
using unicode arrows in place of <- and ->, but that's probably
heading too far down the apl/fortress road!

Hadley

-- 
http://had.co.nz/



More information about the R-devel mailing list