[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")

Mon Jul 4 09:15:59 CEST 2005

[ Hmm, is everyone of those interested in changes inside R "sleeping" ,
  uninterested, ... 
]

>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Fri, 1 Jul 2005 18:36:54 +0200 writes:

>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>>     on 28 Jun 2005 14:57:42 +0200 writes:

    PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
    >>> The issue is not with boxplot, but with split.  boxplot.formula() 
    >>> calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
    >>> but look at what split() returns when there are empty levels in
    >>> the factor:
    >>> 
    >>> > f <- factor(gl(3, 6), levels=1:5)
    >>> > y <- rnorm(f)
    >>> > split(y, f)
    >>> $"1"
    >>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
    >>> 
    >>> $"2"
    >>> [1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801
    >>> 
    >>> $"3"
    >>> [1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561
    >>> 
    >>> The "culprit" is the following in split.default():
    >>> 
    >>> f <- factor(f)
    >>> 
    >>> which drops empty levels in f, if there are any.  BTW, ?split doesn't
    >>> mention what it does in such situation.  Perhaps it should?
    >>> 
    >>> If this is to be "fixed", I suppose an additional argument, e.g.,
    >>> drop=TRUE, can be added, and the corresponding line mentioned
    >>> above changed to something like:
    >>> 
    >>> if (drop || !is.factor(f)) f <- factor(f)
    >>> 
    >>> Then this additional argument can be pass on from boxplot.formula() to 
    >>> split().

    PD> Alternatively, I suspect that the intention was as.factor() rather
    PD> than factor(). 

    MM> at first I thought Peter was right; but the real source of
    MM> split.default contains a comment (!) and that line is

    MM> f <- factor(f) # drop extraneous levels

    MM> so it seems, this was done there very much on purpose.    
    MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
    MM> does keep the empty levels in the example

    MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)

    PD> It does require a bit of care to fix it that way,
    PD> though. There could be problems with empty levels popping up in
    PD> unexpected places. 

    MM> Indeed!
    MM> Given the new facts, I think we want to go in Andy's direction
    MM> with a new argument, 'drop'

    MM> A Peter mentioned, the real question is about its default.
    MM> "drop = TRUE"   would be fully compatible with previous versions of R.
    MM> "drop = FALSE"  would be compatible with S and S-plus.

    MM> I'm going to implement it, and try to see if 'drop = FALSE'
    MM> gives changes for R and its standard packages;  if 'yes', that
    MM> would be an indication that such a R-back-compatibility breaking
    MM> change was not a good idea.  If 'no', I could commit it and see
    MM> if it has an effect on the CRAN packages....

    MM> Of course, since split() and split()<- are S3 generics, and
    MM> since there's also unsplit(),  this entails a whole slew of
    MM> changes {adding a "drop = FALSE" argument everywhere!}
    MM> and I presume will break everyone's code who has written own
    MM> split.foobar methods....

    MM> great...

    MM> Martin

The change doesn't seem to affect the "standard" packages at all
which is good.  On CRAN, it seems there are two packages only that
have  split() or split()<-  methods,  namely 'spatstat' and 'compositions'.

If we introduced the extra argument 'drop', 
these and every other user code defining split methods would
have to be updated to be compatible with the changed (S3)
generic having an extra argument 'drop'.

With this in mind, after more thought, I think that Peter's
initial proposal ---just replacing 'factor()' by 'as.factor()'
inside split--- seems to be nicer than introducing 'drop' and
*change* the default behavior to  'drop = FALSE' for the
following reasons : 

1) people who rely on the current behavior would have to change
   their calls to split() anyway;

2) instead of calling  
       split(x, f, drop=TRUE)
   they can as well go for
       split(x, factor(f)) 
   which has identical effect but does not introduce an extra
   argument 'drop'.

3) advantage of slightly higher compatibility with S

---

I intend to change this in R-devel
{with appropriate notes in NEWS !} during this week, unless
someone finds good reasons for a different (or no) change.

Martin