[Rd] adding a built-in drop.levels option for subset() in 2.12 ?

Peter Dalgaard pdalgd at gmail.com
Mon Aug 16 01:23:26 CEST 2010


Ben Bolker wrote:
>   With the approach of R 2.12.0:
> 
>   with mild apologies for re-opening this perennial issue:
> is there any hope, if appropriate patches are submitted, of adding a
> drop.levels argument (with default equal to FALSE to preserve backward
> compatibility/efficiency) to the subset function ... ?
>   If not, would a patch to the documentation and/or the R FAQ be accepted?

I don't think it is desirable (I probably said so before).

As far as I'm concerned, factors should NOT change their level set from
subsetting, and if you want them to lose unused levels, f <- factor(f)
gets you there soon enough.

> 
>   This does seem to be a continuing source of confusion/frustration (it
> certainly is among my students, and here is some documentation from
> r-help over the years). 

Well, if you don't give students tasks where it is important to preserve
the levels set, then I can believe that they might be frustrated that
empty levels are retained. However, if you have a data set with 50-odd
responses of (say) good-medium-poor, they would get equally frustrated
by having to reinstate the three factor levels after subsetting.
(Perhaps you need to have been exposed to SAS PROC FREQ's notorious
inability to generate zero-counts, or SPSS barplots labeled 4-6-7-9-10,
to see the point.)

As far as I can see, the confusion mainly arises when the factor itself
is used in subsetting: "I selected sex=='M', but 'F' is still listed as
a level".  One has to ask whether the same reaction would have been
triggered from (say) selecting everyone over 6 foot 2, which just
happened to be an all-male population. I suggest that this would be
taken as completely uncontroversial:

> data(juul2)
> juul2 <- transform(juul2, sex=factor(sex,labels=c("M","F")))
> with(subset(juul2, height > 187), table(sex))
sex
 M  F
23  0


If the selection is explicitly on sex, it may _feel_ like a
contradiction if the other sex is "still present", but to R, a subset is
a subset, and R cannot reasonably treat the two cases differently.


 Note that some of the earliest threads here
> refer to the problem (now fixed) that the subset() documentation failed
> to note that the existing 'drop' argument would *not* (confusingly) drop
> unused levels.

(Was that actually misdocumented at the time? Otherwise, I honestly
don't know where that confusion came from:  drop=TRUE drops single-level
dimensions, which it also does in matrix and data frame indexing.)

> 
> http://finzi.psych.upenn.edu/Rhelp10/2008-April/158566.html
> http://finzi.psych.upenn.edu/R/Rhelp02/archive/42976.html
> http://finzi.psych.upenn.edu/R/Rhelp02/archive/36961.html
> http://finzi.psych.upenn.edu/Rhelp10/2009-November/217878.html
> http://article.gmane.org/gmane.comp.lang.r.general/200395
> 
>   This suggestion is milder and less wide-ranging than a global
> drop.unused.levels option, or than convincing everyone to use strings
> rather than factors most of the time ...
> 
>   cheers
>     Ben Bolker
> 
> 


-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list