[R] Basic question on concatenating factors

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Nov 23 08:36:10 CET 2008


On Sun, 23 Nov 2008, jim holtman wrote:

> You are right.  union used 'unique(c(x,y))' and I am not sure if
> 'unique' preserves the order, but the help page seems to indicate that
> "an element is omitted if it is identical to any previous element ";
> this might mean that the order is preserved.

It says

      'unique' returns a vector, data frame or array like 'x' but with
      duplicate elements/rows removed.

Although it is a generic function, it is hard to see how that can be 
interpreted to allow the order to be changed.

The claim that union would be more efficiently implemented via sorting is 
made with no evidence: do look up a basic computer science textbook for 
this kind of thing, as well as how R actually does it.  (Also 'efficient' 
was not defined: both speed and memory usage are potentially measures of 
efficiency.)  But for example

> x <- rnorm(1e7)
> system.time(unique(x))
    user  system elapsed
   2.258   0.261   2.523
> system.time(sort(x))
    user  system elapsed
   4.102   0.112   4.231
> system.time(sort(x, method="quick"))
    user  system elapsed
   1.928   0.109   2.047

will indicate that unique() is comparable in speed to sorting.


>
> On Sat, Nov 22, 2008 at 11:43 PM, Stavros Macrakis
> <macrakis at alum.mit.edu> wrote:
>> On Sat, Nov 22, 2008 at 10:20 AM, jim holtman <jholtman at gmail.com> wrote:
>>>  c.Factor <-
>>> function (x, y)
>>> {
>>>    newlevels = union(levels(x), levels(y))
>>>    m = match(levels(y), newlevels)
>>>    ans = c(unclass(x), m[unclass(y)])
>>>    levels(ans) = newlevels
>>>    class(ans) = "factor"
>>>    ans
>>> }
>>
>> This algorithm depends crucially on union preserving the order of the
>> elements of its arguments. As far as I can tell, the spec of union
>> does not require this.  If union were to (for example) sort its
>> arguments then merge them (generally a more efficient algorithm), this
>> function would no longer work.
>>
>> Fortunately, the fix is simple.  Instead of union, use:
>>
>>     newlevels <- c(levels(x),setdiff(levels(y),levels(x))
>>
>> which is guaranteed to preserve the order of levels(x).
>>
>>             -s
>>
>
>
>
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list