[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")

Fri Jul 1 18:36:54 CEST 2005

>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>>     on 28 Jun 2005 14:57:42 +0200 writes:

    PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
    >> The issue is not with boxplot, but with split.  boxplot.formula() 
    >> calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
    >> but look at what split() returns when there are empty levels in
    >> the factor:
    >> 
    >> > f <- factor(gl(3, 6), levels=1:5)
    >> > y <- rnorm(f)
    >> > split(y, f)
    >> $"1"
    >> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
    >> 
    >> $"2"
    >> [1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801
    >> 
    >> $"3"
    >> [1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561
    >> 
    >> The "culprit" is the following in split.default():
    >> 
    >> f <- factor(f)
    >> 
    >> which drops empty levels in f, if there are any.  BTW, ?split doesn't
    >> mention what it does in such situation.  Perhaps it should?
    >> 
    >> If this is to be "fixed", I suppose an additional argument, e.g.,
    >> drop=TRUE, can be added, and the corresponding line mentioned
    >> above changed to something like:
    >> 
    >> if (drop || !is.factor(f)) f <- factor(f)
    >> 
    >> Then this additional argument can be pass on from boxplot.formula() to 
    >> split().

    PD> Alternatively, I suspect that the intention was as.factor() rather
    PD> than factor(). 

at first I thought Peter was right; but the real source of
split.default contains a comment (!) and that line is

    f <- factor(f) # drop extraneous levels

so it seems, this was done there very much on purpose.    
OTOH, S(-plus) has implemented it quite a bit differently, and actually
does keep the empty levels in the example

  f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)

    PD> It does require a bit of care to fix it that way,
    PD> though. There could be problems with empty levels popping up in
    PD> unexpected places. 

Indeed!
Given the new facts, I think we want to go in Andy's direction
with a new argument, 'drop'

A Peter mentioned, the real question is about its default.
"drop = TRUE"   would be fully compatible with previous versions of R.
"drop = FALSE"  would be compatible with S and S-plus.

I'm going to implement it, and try to see if 'drop = FALSE'
gives changes for R and its standard packages;  if 'yes', that
would be an indication that such a R-back-compatibility breaking
change was not a good idea.  If 'no', I could commit it and see
if it has an effect on the CRAN packages....

Of course, since split() and split()<- are S3 generics, and
since there's also unsplit(),  this entails a whole slew of
changes {adding a "drop = FALSE" argument everywhere!}
and I presume will break everyone's code who has written own
split.foobar methods....

great...

Martin