[R] combining data from different datasets

Barry Rowlingson b.rowlingson at lancaster.ac.uk
Fri Oct 24 20:05:58 CEST 2008


2008/10/24 Gabor Grothendieck <ggrothendieck at gmail.com>:

> NA and "NA" are not the same:
>
>> DF <- data.frame(x = c("a", "NA", NA))
>> DF
>     x
> 1    a
> 2   NA
> 3 <NA>
>>
>> is.na(NA)
> [1] TRUE
>> is.na("NA")
> [1] FALSE

 Yes, but unless you tell it otherwise, read.table will think Namibia
is an NA, even in a column of alphabetic strings:

1,US
2,NA
3,UK

read.table("test.dat",sep=",")
  V1   V2
1  1   US
2  2 <NA>
3  3   UK

 So you think you can use na.strings? Calling with na.strings seems to
work on both columns, and hence converts columns with real NAs into
Factors. Here's some data:

$ cat test.dat
1,US
2,NA
3,UK
NA,FR
4,PT

We need column 1 to be integer with an NA, and column 2 to be text
with a real "NA" and not a <NA>:

 Try #1 (NAive effort) reads NA(mibia) as NA(missing), keeps V1 as integers:

> read.table("test.dat",sep=",")
  V1   V2
1  1   US
2  2 <NA>
3  3   UK
4 NA   FR
5  4   PT

 = FAIL

 Try #2 reads NAmibia okay, but reads V1 as factor:

> read.table("test.dat",sep=",",na.strings="")
  V1 V2
1  1 US
2  2 NA
3  3 UK
4 NA FR
5  4 PT

> str(read.table("test.dat",sep=",",na.strings=""))
'data.frame':	5 obs. of  2 variables:
 $ V1: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 5 4
 $ V2: Factor w/ 5 levels "FR","NA","PT",..: 5 2 4 1 3

  = FAIL

 #3 lets try colClasses:

 > read.table("test.dat",sep=",",colClasses=c("numeric","character"))
  V1   V2
1  1   US
2  2 <NA>
3  3   UK
4 NA   FR
5  4   PT

 = FAIL

 #4 So... lets try to specify colClasses and na.strings:

 > read.table("test.dat",sep=",",na.strings="",colClasses=c("numeric","character"))
  V1 V2
1  1 US
2  2 NA
3  3 UK
4 NA FR
5  4 PT

 - looks good:

 > str(read.table("test.dat",sep=",",na.strings="",colClasses=c("numeric","character")))
'data.frame':	5 obs. of  2 variables:
 $ V1: num  1 2 3 NA 4
 $ V2: chr  "US" "NA" "UK" "FR" ...

 = WIN!

 I'm not certain how that works. I guess the conversion of column 1 to
numeric causes the NA rather than the matching of it to the na.strings
parameter....

Barry



More information about the R-help mailing list