[Rd] Dropping unused levels of a factor that has "NA" as a level

Wed Jul 19 12:17:13 CEST 2006

It is history:

r16144 | ripley | 2001-09-28 19:40:28 +0100 (Fri, 28 Sep 2001) | 2 lines

add is.na<-, distinguish NA level and NA codes in factors

so predates having NA character strings distinct from "NA".

On Tue, 11 Jul 2006, Brahm, David wrote:

> I mentioned this in R-help on April 28:
> <https://stat.ethz.ch/pipermail/r-help/2006-April/104595.html>
> 
> | as.character.factor contains this line (where cx=levels(x)[x]):
> |   if ("NA" %in% levels(x)) cx[is.na(x)] <- "<NA>"
> |
> | Is it possible that this is no longer the desired behavior?  These
> | two results don't seem very consistent:
> |
> | > as.character(as.factor(c("AB", "CD", NA)))
> | [1] "AB" "CD" NA  
> | > is.na(.Last.value)[3]
> | [1] TRUE
> |
> | > as.character(as.factor(c("NA", "CD", NA)))
> | [1] "NA"   "CD"   "<NA>"
> | > is.na(.Last.value)[3]
> | [1] FALSE
> |
> | I'm using R-2.3.0 on Redhat Linux, but I don't think the behavior
> | is new (maybe since character NA's were introduced?).
> |
> | -- David Brahm (brahm at alum.mit.edu)
> 
> 
> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Peter Dalgaard
> Sent: Tuesday, July 11, 2006 5:59 PM
> To: J. Hosking
> Cc: r-devel at stat.math.ethz.ch
> Subject: Re: [Rd] Dropping unused levels of a factor that has "NA" as a level
> 
> "J. Hosking" <jh910 at juno.com> writes:
> 
> > Is this a bug?
> > 
> >    > f1 <- factor(c("a", NA), levels = c("a", "NA") )
> >    > f2 <- f1[, drop = TRUE]
> >    > f2
> >    [1] a    <NA>
> >    Levels: a <NA>
> > 
> > I would have expected f2 to have only one level, "a".  It seems
> > to me that the code in [.factor does not follow the advice in
> > help("factor") on how to set factor codes to be missing when
> > "NA" is a level of the factor.
> 
> 
> Something odd is going on, that's for sure...
> 
> The problem is also there with factor(f1). And the logic in
> as.character.factor seems to be at the root of it:
> 
> > as.character.factor
> function (x, ...)
> {
>     cx <- levels(x)[x]
>     if ("NA" %in% levels(x))
>         cx[is.na(x)] <- "<NA>"
>     cx
> }
>  
> This looks like something from before we had character NA values. I
> wonder if it is a mistake or there could actually be a reason to
> keep it. 
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595