[R] R 1.2.1 - read.table - factors problem or is it a data.frame problem

Sun Feb 4 23:33:35 CET 2001

Brian Ripley notes:
> On Fri, 2 Feb 2001, Martin Maechler wrote:
> 
> > >>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
> >
> >     PD> "Heberto Ghezzo" <Heberto at meakins.lan.mcgill.ca> writes:
> >     >> I have some problems with read.table and floats turning up as
> >     >> factors. In my case it was not a blank in the file but an unary
> >     >> minus!! so 3.24,-57.23,... the 3.24 is numeric but -57.23 is a
> >     >> factor. Yes I turned it into a numeric with
> >     >> as.numeric(as.character(.. but I think it will be better to modify
> >     >> somehow the read.table/read.csv code.
> >     >> Thanks anyway.
> >
> >     PD> That certainly sounds like a bug, but I can't reproduce it:
> >
> >     PD> $ cat > xx
> >     PD> -1,2,3
> >     PD> 1,-2,3
> >     PD> $ R
> >     PD> ...
> >     >> summary(read.csv('xx',head=F))
> >     PD> V1             V2           V3
> >     PD> Min.   :-1.0   Min.   :-2   Min.   :3
> >     PD> 1st Qu.:-0.5   1st Qu.:-1   1st Qu.:3
> >     PD> Median : 0.0   Median : 0   Median :3
> >     PD> Mean   : 0.0   Mean   : 0   Mean   :3
> >     PD> 3rd Qu.: 0.5   3rd Qu.: 1   3rd Qu.:3
> >     PD> Max.   : 1.0   Max.   : 2   Max.   :3
> >
> >     PD> Could you give us some further details on the setup that is
> >     causing PD> that effect?
> >
> > Heberto uses a Windoze mailer, hence probably ..
> >
> > It could be that the problem comes from the fact that some win users
> > use non-ASCII minus characters (i.e. not "minus", but these find them on
> > their keyboards when typing in the data ..):
> >
> > In iso_8859-1 aka "latin-1" (of which most European MSWin localizations
> > are    said to be a superset) there are three kinds of "-" :
> >
> >        Oct   Dec   Hex   Char   Description
> >       
> >        --------------------------------------------------------------------
055   45    2D     -     Minux  [The standard ASCII one]
> >
> >        255   173   AD          SOFT HYPHEN
> >
> >        257   175   AF     ¯     MACRON
> 
> Actually, not as far as I can find out (and I have been working on
> encodings for the next releases of R). The first really is hyphen in both
> latin-1 and WinAnsi (the main Windows char set: the other, WinOEM, is not a
> superset of latin-1).  Minus is not in the WinAnsi char set, but it does
> have hyphen at 45 and 173 (it has two spaces too).
> 
> Unfortunately Adobe's ISOLatin1 encoding for postscript is not the same as
> latin-1. That does have minus at 45 and (real) hyphen at 173.
> 
> As Windows NT/2000 machines support Unicode, on those the set of
> possible inputs is much wider and I don't think R will cope with
> Unicode-encoded files.  In Unicode minus is at 138 (and hyphen at 45).
> 
> It's a possible explanation, but then I don't think
> as.numeric(as.character( would work.  My guess was that there was some
> other non-printing character in that field, but that has the same
> counter-argument.
> 
I had sought help a few days earlier for a problem with some similarities. In
my case I had failed to recognize the existence of some NA's. I had a data set
which originated in 1966. Some IBM statistical packages of the era encoded NA's
as binary negative zeros. These were propogated in passes through the SAS first
edition. I can't remember how they were then encoded in EBCDIC by different
FORTRAN compilers, nor ultimately in ASCII conversions. However they relied on
program filters and were otherwise invisible.

Gordon M. Harrington		Mail:	3720 Village Place, #6308
Professor Emeritus			Waterloo, IA 50702-5848
University of Northern Iowa 	Phone:	319-291-8535
gordon.harrington at uni.edu	Fax:	319-291-8491
dryfly at aya.yale.edu			319-291-8324

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._