[Rd] c.factor

Wed Nov 15 13:51:17 CET 2006

Prof Ripley,

> Well, R has managed without a factor method for c() for most of its
decade 
> of existence (not that it originally had factors as we know them).

R has managed without other things too for most of its decade. For
example, row names in data frames have very recently been made
efficient. That is an example how R was managing for a decade but an
improvement has still been made. As we become aware of what we believe
is missing in R, I believe the correct approach, the approach you
advocate, is to contribute back to the list. This is what I did. I also
contributed a potential solution in the form of working source code. I
stand by my statement that the current result of c(x,y) when x and y are
factors is not useful. It is a specific statement about a specific
operation, not any general criticism of R. I agree with you that factors
are best viewed as an enumeration type, but I would argue further that
c() of 2 enumerated types should return an enumerated type, retaining
the powerful feature of enumerated types in R. However, currently R
ignores the fact that x and y are enumerated. It silently ignores the
levels information, and returns an integer vector whose integers are,
well, not useful. Or, if you prefer, not as useful as the proposal I
posted.

I have a solution which works for me, and I have contributed it. One
other person has shown some interest, and taken it further to work with
multiple arguments which looks like a nice improvement.

The only thing I would comment, if c.factor does go further, is to
please avoid the use of as.character in the implementation. One key
advantage of the factor type is precisely that it is enumerated, and
therefore is efficient for categorical data sets. Intermediate coercion
to character is inefficient in this case, which is why I avoided it in
the solution I posted.

Regards,
Matthew

> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
> Sent: 14 November 2006 18:23
> To: Marc Schwartz
> Cc: Matthew Dowle; r-devel at r-project.org
> Subject: Re: [Rd] c.factor
> 
> 
> Well, R has managed without a factor method for c() for most 
> of its decade 
> of existence (not that it originally had factors as we know them).
> 
> I would argue that factors are best viewed as an enumeration 
> type, and 
> anything which silently changes their level set is a bad 
> idea.  I can see 
> a case for a c() method for factors that combines factors 
> with the same 
> level sets, but I can also see this is best done by users who 
> know the 
> level sets are same (c.factor would have to expend a 
> considerable effort 
> to check).
> 
> You also need to consider the dispatch rules.  c.factor will 
> be called 
> whenever the first argument is a factor, whatever the others 
> are. S4 (I 
> think, definitely S4-based versions of S-PLUS) has an 
> alternative concat() 
> that works differently (recursively) and seems a more natural model.
> 
> 
> On Tue, 14 Nov 2006, Marc Schwartz wrote:
> 
> > On Tue, 2006-11-14 at 11:51 -0600, Marc Schwartz wrote:
> >> On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:
> >>> Hi,
> >>>
> >>> Given factors x and y,  c(x,y) does not seem to return a useful 
> >>> result :
> >>>> x
> >>> [1] a b c d e
> >>> Levels: a b c d e
> >>>> y
> >>> [1] d e f g h
> >>> Levels: d e f g h
> >>>> c(x,y)
> >>>  [1] 1 2 3 4 5 1 2 3 4 5
> >>>>
> >>>
> >>> Is there a case for a new method c.factor as follows?  Does 
> >>> something similar exist already?  Is there a better way 
> to write the 
> >>> function?
> >>>
> >>>> c.factor = function(x,y)
> >>> {
> >>>     newlevels = union(levels(x),levels(y))
> >>>     m = match(levels(y), newlevels)
> >>>     ans = c(unclass(x),m[unclass(y)])
> >>>     levels(ans) = newlevels
> >>>     class(ans) = "factor"
> >>>     ans
> >>> }
> >>>> c(x,y)
> >>>  [1] a b c d e d e f g h
> >>> Levels: a b c d e f g h
> >>>> as.integer(c(x,y))
> >>>  [1] 1 2 3 4 5 4 5 6 7 8
> >>>>
> >>>
> >>> Regards,
> >>> Matthew
> >>
> >> I'll defer to others as to whether or not there is a basis for 
> >> c.factor,
> >> however:
> >>
> >> c.factor <- function(...)
> >> {
> >>   args <- list(...)
> >>
> >>   # this could be optional
> >>   if (!all(sapply(args, is.factor)))
> >>    stop("All arguments must be factors")
> >>
> >>   factor(unlist(lapply(args, function(x) as.character(x)))) }
> >
> >
> > That last line can even be cleaned up, as I was doing something else
> > initially:
> >
> > c.factor <- function(...)
> > {
> >  args <- list(...)
> >
> >  if (!all(sapply(args, is.factor)))
> >   stop("All arguments must be factors")
> >
> >  factor(unlist(lapply(args, as.character)))
> > }
> >
> >
> > Marc
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
>