[R] R 1.2.1 - read.table - factors problem or is it a data.frame problem

Prof Brian D Ripley ripley at stats.ox.ac.uk
Fri Feb 2 09:11:11 CET 2001


On Fri, 2 Feb 2001, Martin Maechler wrote:

> >>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
>
>     PD> "Heberto Ghezzo" <Heberto at meakins.lan.mcgill.ca> writes:
>     >> I have some problems with read.table and floats turning up as
>     >> factors. In my case it was not a blank in the file but an unary minus!!
>     >> so 3.24,-57.23,... the 3.24 is numeric but -57.23 is a factor.
>     >> Yes I turned it into a numeric with as.numeric(as.character(.. but I
>     >> think it will be better to modify somehow the read.table/read.csv
>     >> code.
>     >> Thanks anyway.
>
>     PD> That certainly sounds like a bug, but I can't reproduce it:
>
>     PD> $ cat > xx
>     PD> -1,2,3
>     PD> 1,-2,3
>     PD> $ R
>     PD> ...
>     >> summary(read.csv('xx',head=F))
>     PD> V1             V2           V3
>     PD> Min.   :-1.0   Min.   :-2   Min.   :3
>     PD> 1st Qu.:-0.5   1st Qu.:-1   1st Qu.:3
>     PD> Median : 0.0   Median : 0   Median :3
>     PD> Mean   : 0.0   Mean   : 0   Mean   :3
>     PD> 3rd Qu.: 0.5   3rd Qu.: 1   3rd Qu.:3
>     PD> Max.   : 1.0   Max.   : 2   Max.   :3
>
>     PD> Could you give us some further details on the setup that is causing
>     PD> that effect?
>
> Heberto uses a Windoze mailer, hence probably ..
>
> It could be that the problem comes from the fact that some win users
> use non-ASCII minus characters (i.e. not "minus", but these find them on
> their keyboards when typing in the data ..):
>
> In iso_8859-1 aka "latin-1" (of which most European MSWin localizations are
>    said to be a superset) there are three kinds of "-" :
>
>        Oct   Dec   Hex   Char   Description
>        --------------------------------------------------------------------
>        055   45    2D     -     Minux  [The standard ASCII one]
>
>        255   173   AD     ­     SOFT HYPHEN
>
>        257   175   AF     ¯     MACRON

Actually, not as far as I can find out (and I have been working on
encodings for the next releases of R). The first really is hyphen in both
latin-1 and WinAnsi (the main Windows char set: the other, WinOEM, is not a
superset of latin-1).  Minus is not in the WinAnsi char set, but it does
have hyphen at 45 and 173 (it has two spaces too).

Unfortunately Adobe's ISOLatin1 encoding for postscript is not the same as
latin-1. That does have minus at 45 and (real) hyphen at 173.

As Windows NT/2000 machines support Unicode, on those the set of
possible inputs is much wider and I don't think R will cope with
Unicode-encoded files.  In Unicode minus is at 138 (and hyphen at 45).

It's a possible explanation, but then I don't think
as.numeric(as.character( would work.  My guess was that there was some
other non-printing character in that field, but that has the same
counter-argument.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list