[R] Basic question on concatenating factors

Sun Nov 23 15:40:33 CET 2008

You do have to read a little further on the help page to make sure
that duplicates are removed if they appear after, and not before,
others in the vector to see that the order is preserved:

"Note that unlike the Unix command uniq this omits duplicated and not
just repeated elements/rows. That is, an element is omitted if it is
identical to any previous element and not just if it is the same as
the immediately previous one. "

This does make it clear that the original order is preserved since it
is succeeding elements that are removed.  So from this, I assume that
the use of

unique(x,y)

does preserve the original ordering of the elements.

On Sun, Nov 23, 2008 at 2:36 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
> On Sun, 23 Nov 2008, jim holtman wrote:
>
>> You are right.  union used 'unique(c(x,y))' and I am not sure if
>> 'unique' preserves the order, but the help page seems to indicate that
>> "an element is omitted if it is identical to any previous element ";
>> this might mean that the order is preserved.
>
> It says
>
>     'unique' returns a vector, data frame or array like 'x' but with
>     duplicate elements/rows removed.
>
> Although it is a generic function, it is hard to see how that can be
> interpreted to allow the order to be changed.
>
> The claim that union would be more efficiently implemented via sorting is
> made with no evidence: do look up a basic computer science textbook for this
> kind of thing, as well as how R actually does it.  (Also 'efficient' was not
> defined: both speed and memory usage are potentially measures of
> efficiency.)  But for example
>
>> x <- rnorm(1e7)
>> system.time(unique(x))
>
>   user  system elapsed
>  2.258   0.261   2.523
>>
>> system.time(sort(x))
>
>   user  system elapsed
>  4.102   0.112   4.231
>>
>> system.time(sort(x, method="quick"))
>
>   user  system elapsed
>  1.928   0.109   2.047
>
> will indicate that unique() is comparable in speed to sorting.
>
>
>>
>> On Sat, Nov 22, 2008 at 11:43 PM, Stavros Macrakis
>> <macrakis at alum.mit.edu> wrote:
>>>
>>> On Sat, Nov 22, 2008 at 10:20 AM, jim holtman <jholtman at gmail.com> wrote:
>>>>
>>>>  c.Factor <-
>>>> function (x, y)
>>>> {
>>>>   newlevels = union(levels(x), levels(y))
>>>>   m = match(levels(y), newlevels)
>>>>   ans = c(unclass(x), m[unclass(y)])
>>>>   levels(ans) = newlevels
>>>>   class(ans) = "factor"
>>>>   ans
>>>> }
>>>
>>> This algorithm depends crucially on union preserving the order of the
>>> elements of its arguments. As far as I can tell, the spec of union
>>> does not require this.  If union were to (for example) sort its
>>> arguments then merge them (generally a more efficient algorithm), this
>>> function would no longer work.
>>>
>>> Fortunately, the fix is simple.  Instead of union, use:
>>>
>>>    newlevels <- c(levels(x),setdiff(levels(y),levels(x))
>>>
>>> which is guaranteed to preserve the order of levels(x).
>>>
>>>            -s
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?