R-alpha: NA in data frame not detected

Thomas Lumley thomas@biostat.washington.edu
Fri, 27 Jun 1997 12:21:29 -0700 (PDT)


On Fri, 27 Jun 1997, Friedrich Leisch wrote:

> 
> NA's in a data frame are not handled properly, if the data frame was
> read in using read.table (but I'm not sure if that is the reason of
> the problems):
> 

The problem is with the na.strings argument of read.table.  If you use
this argument anything with an NA ends up as a factor. In your example the
first column is a factor (though you can't tell by looking at it)
> is.factor(x[,1])
[1] TRUE
> levels(x[,1])
[1] "1"  "2"  "3"  "4"  "6"  "7"  "8"  "NA"
>

The reason is that read.table tries to handle na.strings twice.  The
scan() function gets the na.strings argument and so translates the "?"
into "NA".  The type.convert() function then also gets the na.strings
argument and thinks that NAs are indicated by "?" which is no longer
true. It then decides that the data are not numeric and returns a factor.

The solution seems to be to stop passing na.strings to type.convert(). We
need to keep na.strings in scan() to allow for missing character data.

A patch is at the end of this message. This may break something else,
of course ;-).

Thomas Lumley
------------------------------------------------------+------
Biostatistics		: "Never attribute to malice what  :
Uni of Washington	:  can be adequately explained by  :
Box 357232		:  incompetence" - Hanlon's Razor  :
Seattle WA 98195-7232	:				   :
------------------------------------------------------------

*** read.table.rnew	Fri Jun 27 11:26:15 1997
--- read.table.orig	Fri Jun 27 11:21:20 1997
***************
*** 56,66 ****
          if (length(as.is) != cols) 
                  stop("as.is has the wrong length")
          for (i in 1:cols) {
!                 if (!as.is[i]) {
! 			 data[[i]]<-type.convert(data[[i]])
! #                        data[[i]] <- type.convert(data[[i]], 
! #                                na.strings = na.strings)
! 			}
          }
          #  now we determine row names
          if (missing(row.names)) {
--- 56,64 ----
          if (length(as.is) != cols) 
                  stop("as.is has the wrong length")
          for (i in 1:cols) {
!                 if (!as.is[i]) 
!                         data[[i]] <- type.convert(data[[i]], 
!                                 na.strings = na.strings)
          }
          #  now we determine row names
          if (missing(row.names)) {

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-