[Rd] RFC: hexadecimal constants and decimal points

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Apr 17 13:38:10 CEST 2005


These are some points stimulated by reading about C history (and 
related in their implementation).


1) On some platforms

> as.integer("0xA")
[1] 10

but not all (not on Solaris nor Windows).  We do not define what is 
allowed, and rely on the OS's implementation of strtod (yes, not strtol). 
It seems that glibc does allow hex: C99 mandates it but C89 seems not to 
allow it.

I think that was a mistake, and strtol should have been used.  Then C89
does mandate the handling of hex constants and also octal ones.  So 
changing to strtol would change the meaning of as.integer("011").

Proposal: we handle this ourselves and define what values are acceptable,
namely for as.integer:

[+|-][0-9]+
NA
0[x|X][0-9A-fa-f]+

in all cases such that the converted value is in-range.  (This does mean 
as.integer("1e+05") would be invalid, but is it clear that is allowed 
now?)

For as.numeric(), probably the C99 rules (which include NaN, Inf, 
Infinity, and we need to add NA).

Alternatively, make and document the semantics to be
as.integer(as.numeric(char_string)) (which is effectively what we have 
now, although it causes surprises).

[As a side point, some locales may accept non-Roman digits.  I think we 
need to exclude those everywhere, not just some places like parsing.]


2) R does not have integer constants.  It would be convenient if it did, 
and I can see no difficulty in allowing the same conversions when parsing 
as when coercing.  This would have the side effect that 100 would be 
integer (but the coercion rules would come into play) but 
200000000000000000 would be double.  And x <-0xce80 would be valid.


3) We do allow setting LC_NUMERIC, but it partially breaks R if the 
decimal point is not ".".  (I know of no locale in which it is not "." or 
",", and we cannot allow "," as part of numeric constants when parsing.) 
E.g.:

> Sys.setlocale("LC_NUMERIC", "fr_FR")
[1] "fr_FR"
Warning message:
setting 'LC_NUMERIC' may cause R to function strangely in: 
setlocale(category, locale)
> x <- 3.12
> x
[1] 3
> as.numeric("3,12")
[1] 3,12
> as.numeric("3.12")
[1] NA
Warning message:
NAs introduced by coercion

We could do better by insisting that "." was the decimal point in all 
interval conversions _to_ numeric.  Then the effect of setting LC_NUMERIC 
would primarily be on conversions _from_ numeric, especially printing and 
graphical output.  (One issue would be what to do with scan(), which has a 
`dec' argument but is implemented assuming LC_NUMERIC=C.  I would hope to 
continue to have `dec' but perhaps with a locale-dependent default.)  The 
resulting asymmetry (R would not be able to parse its own output) would be 
unhappy, but seems inevitable. (This could be implemented easily by having 
a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to 
control that rather than actually setting the local category.  For 
example, deparsing needs to be done in LC_NUMERIC=C.)


All of these could be implemented by customized versions of 
strtod/strtol.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list