[Rd] Bug in tapply with factors containing NAs (PR#6672)

Peter Dalgaard p.dalgaard at biostat.ku.dk
Mon Mar 15 12:20:22 MET 2004


george.leigh at dpi.qld.gov.au writes:

> Full_Name: George Leigh
> Version: 1.8.1
> OS: Windows 2000
> Submission from: (NULL) (203.25.1.208)
> 
> 
> The following example gives the correct answer when the first argument of tapply
> is a numeric vector, but an incorrect answer when it is a factor.  If the
> function used by tapply is "length", the type and contents of the first argument
> should make no difference, provided it has the same length as the second
> argument.
> 
> > x = c(NA, 1)
> > y = factor(x)
> > tapply(x, y, length)
> 1 
> 1 
> > tapply(y, y, length)
> 1 
> 2 
> >

The core of this is that

> split(y,y)
$"1"
[1] <NA> 1
Levels: 1

> split(x,y)
$"1"
[1] 1


which in turn comes from the innards of split.default:

...
    if (is.null(attr(x, "class")) && is.null(names(x)))
        return(.Internal(split(x, f)))
    lf <- levels(f)
    y <- vector("list", length(lf))
    names(y) <- lf
    for (k in lf) y[[k]] <- x[f == k]
    y

Factors have a class attribute, so you don't use the internal code in
that case and

> y[y=="1"]
[1] <NA> 1
Levels: 1 

I think the line in split.default  needs to read

    for (k in lf) y[[k]] <- x[!is.na(f) & f == k]

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907



More information about the R-devel mailing list