[R] how to efficiently compute set unique?

Tue Jun 22 04:01:17 CEST 2010

On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:
>
>> On 21/06/2010 9:06 PM, G FANG wrote:
>>>
>>> Hi,
>>>
>>> I want to get the unique set from a large numeric k by 1 vector, k is
>>> in tens of millions
>>>
>>> when I used the matlab function unique, it takes less than 10 secs
>>>
>>> but when I tried to use the unique in R with similar CPU and memory,
>>> it is not done in minutes
>>>
>>> I am wondering, am I using the function in the right way?
>>>
>>> dim(cntxtn)
>>> [1] 13584763        1
>>> uniqueCntxt = unique(cntxtn);    # this is taking really long
>>
>> What type is cntxtn?  If I do that sort of thing on a numeric vector, it's
>> quite fast:
>>
>> > x <- sample(100000, size=13584763, replace=T)
>> > system.time(unique(x))
>>  user  system elapsed
>>  3.61    0.14    3.75
>
> If it's a factor, it could be as simple as:
>
> levels(cntxtn)  # since the work of "unique-ification" has already been
> done.

Not quite.  When you generate a factor, as you do in your example, the
levels correspond to the unique values of the original vector.  But
when you take a subset of a factor the levels are preserved intact,
even if some of those levels do not occur in the subset.  This is why
there are unusual arguments with names like drop.unused.levels in
functions like model.frame.  It is also a subtle difference in the
behavior of factor(x) and as.factor(x) when x is already a factor.

> ff <- factor(sample.int(200, 1000, replace = TRUE))
> ff1 <- ff[1:40]
> length(levels(ff))
[1] 199
> length(levels(ff1))
[1] 199
> length(levels(as.factor(ff1)))
[1] 199
> length(levels(factor(ff1)))
[1] 34

>> x <- factor(sample(100000, size=13584763, replace=T))
>> system.time(levels(x))
>   user  system elapsed
>      0       0       0
>> system.time(y <- levels(x))
>   user  system elapsed
>      0       0       0