[R] Creating subsets with factors

Frank E Harrell Jr fharrell at virginia.edu
Wed Jan 9 13:53:41 CET 2002


I respectfully disagree with Peter.  In all the data analysis I have done I have found that 0.99 of the time it is most convenient to have unused levels dropped upon subsetting.  In my work I do this by default.  I realize that system overrides are to be avoided at almost all costs, but [.factor is the only function I override for R.  I print a message saying that the traditional behavior may be obtained by using options(drop.unused.levels=FALSE).  I have lobbied for S-Plus and R developers to adopt this approach although having the DEFAULT be drop.unused.levels=FALSE (i.e., users would say options(drop.unused.levels=TRUE) to get my behavior), but an insufficient number of people seem to agree with me on this point.

A slightly more logical way to drop unused levels for the current setup is 

  x <- x[,drop=T]

I have not needed to use c(f1, f2) but it seems to me that Peter's example points out more a deficiency in c or the need for another binding function for this case (which can be done with factor(c(as.character(f1),as.character(f2))) depending on how NAs are handled.

Frank Harrell

On 09 Jan 2002 11:07:52 +0100
Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> wrote:

> Sven Garbade <garbade at psy.uni-muenchen.de> writes:
> 
> > Hi all,
> > 
> > I don't understand the following output. I've created a data subset from
> > a data frame by
> > 
> > > p1.sub <- subset(p1.dat, vp!="p1")
> > 
> > this is ok. But 
> > 
> > > attach(p1.sub)
> > > vp
> >  [1] p1ab p1ab p1ab p1ab p1ab p1br p1br p1br p1br p1br p1kf p1kf p1kf
> > p1kf p1kf
> > [16] p1mg p1mg p1mg p1mg p1mg p1mw p1mw p1mw p1mw p1mw
> > Levels:  p1 p1ab p1br p1kf p1mg p1mw 
> > 
> > shows me that the factor vp has 6 levels instead of 5? 5 should be the
> > correct number of levels, because p1 isn't in the data subset.
> 
> Nope. Factors can have levels that are not present in the data set.
> There are good reasons for this. For instance you cannot c(f1,f2) if
> f1 and f2 are factors with different level sets. 
> 
> If you want to reduce the levels to those present in the factor, use  
> 
> p1.sub$vp <- factor(p1.sub$vp)
> 
> -- 
>    O__  ---- Peter Dalgaard             Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._


-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list