[R] Summary of Characters vectors, NA's and "" in merges

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Sep 28 18:51:31 CEST 2001

On Fri, 28 Sep 2001, David Kane  <David Kane wrote:

> Thanks to Brian Ripley, Gregory Warnes, and Dennis Murphy for considering my
> problem about "NA" in character strings. The nub of the issue seems to be that
> you can not have a string with "NA" in it in a character vector in R without it
> being intrepreted as meaning NA (i.e., not available). The only work-arounds
> involve renames of various sorts.
> Perhaps this is more appropriate for r-devel, but I was wondering what the
> future holds for character vectors in R, i.e., will this always be a
> problem? Although I am not smart enough to understand the Green Book, there is
> a discussion following page 200 that *seems* to suggest that the usage of a
> string class may make it easier to deal with this issue.
> Is there anything coming down the pike on this point?

Well, we can't change character vectors without invalidating the integrity
of lots of saved objects.  One could use another class, but then you would
need functions to handle that class.  In the case in point that won't
help much as merge.data.frame does an as.character when doing the
matching, and a few other things (see below).

The class string exists in S-PLUS 6 but is almost unused. You can do

> foo <- as(c("NA", "OK"), "string")
> foo
[1] "NA" "OK"
> is.na(foo)
[1] F F
> is.na(foo[2]) <- T
> foo
[1] "NA" <NA>
> is.na(foo)
[1] F T
# but be careful:
> foo[2] <- NA
> foo
[1] "NA" "NA"

Note that you can do this with factors, and I tested it previously on your
example. Start with

x <- structure(c(1, 2, NA), levels = c("NA", "OK"), class="factor")
> x
[1] NA OK NA
Levels:  NA OK

Here the first is "NA" and the third really is missing.
So in your original example

> a <- data.frame(x = 1:4)
> b <- data.frame(x = 1:3, y = factor(c("NA", "a", "b"), exclude=""))
> m <- merge(a, b, all.x = TRUE)
> m
  x  y
1 1 NA
2 2  a
3 3  b
4 4 NA

you have lost the distinction (look and see) because of
y[(lxy + 1):(lxy + nxx), ] <- NA

and that suggests that [<-.factor is not quite right.  That shows the
subtleties involved: it does not work in S with string classes either.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list