[R] tapply() and using factor() on a factor

Fri Oct 16 17:33:07 CEST 2009

Thank you Mohamed and Bill for your replies.  (I did not send the data
because it is unwieldy.)

Yes Bill, the issue arises directly from what you had guessed.  I was
working with a subset of the data (which implicitly had factors for the
complete data set).

On this, what is the best way take a subset of the data which ignores
these "extraneous" factors?

> log<-data.frame(Flag=1:2,
RequestID=factor(letters[1:2],levels=letters[1:10]))
> log2 <-subset(log, RequestID=="a")

> levels(log2$RequestID)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

In other words, how do I take a subset which yields "a" as the only
level for log2?

Alex

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, October 15, 2009 11:59 PM
To: Alexander Peterhansl; r-help at r-project.org
Subject: RE: [R] tapply() and using factor() on a factor

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Alexander 
> Peterhansl
> Sent: Thursday, October 15, 2009 2:50 PM
> To: r-help at r-project.org
> Subject: [R] tapply() and using factor() on a factor
> 
> Dear List,
> 
>  
> 
> Shouldn't result1 and result2 be equal in the following case?
> 
>  
> 
> Note that log$RequestID is a factor.  That is, 
> is.factor(log$RequestID)
> yields TRUE.
> 
>  
> 
> result1 <- tapply(log$Flag,factor(log$RequestID),sum)
> 
> result2 <- tapply(log$Flag,log$RequestID,sum)

Showing us the output of dput(log) (or str(log) and summary(log))
would let people discover the problem more readily.  Since you
didn't I'll guess what the dataset may contain.

If log$RequestID is a factor with lots of unused levels tapply
will output an NA for each unused level.  factor(log$RequestID)
will create a new set of levels, only those actually used,
so tapply will not be forced to fill those spots with NA's.  E.g.,

> log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],
levels=letters[1:10]))
> tapply(log$Flag, log$RequestID, sum)
 a  b  c  d  e  f  g  h  i  j
 1  2 NA NA NA NA NA NA NA NA
> tapply(log$Flag, factor(log$RequestID), sum)
a b
1 2

I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see
how to fill the cells with no data behind them, but it doesn't.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
>  
> 
> Yet, when I summarize the output, I get the following:
> 
> summary(result1)
> 
>    Min.    1st Qu.  Median  Mean 3rd Qu.    Max. 
> 
>   11.00   11.00     11.00      26.06   11.00       101.00
> 
>  
> 
> summary(result2)
> 
>    Min. 1st Qu.  Median Mean 3rd Qu.    Max.    NA's 
> 
>   11.00   11.00   11.00        26.06   11.00  101.00   978.00
> 
>  
> 
> Why does result2 have 978 NA's?
> 
>  
> 
> Any help on this would be appreciated.
> 
>  
> 
> Alex
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>