[R] which rows are duplicates?

Tue Mar 31 13:38:10 CEST 2009

Wacek Kusnierczyk wrote:
> Michael Dewey wrote:
>   
>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>     
>>> I would like to know which rows are duplicates of each other, not
>>> simply that a row is duplicate of another row. In the following
>>> example rows 1 and 3 are duplicates.
>>>
>>>       
>>>> x <- c(1,3,1)
>>>> y <- c(2,4,2)
>>>> z <- c(3,4,3)
>>>> data <- data.frame(x,y,z)
>>>>         
>>>     x y z
>>> 1 1 2 3
>>> 2 3 4 4
>>> 3 1 2 3
>>>       
>
> i don't have any solution significantly better than what you have
> already been given.  

i now seem to have one:

    # dummy data
    data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))

    # add a class column; identical rows have the same class id
    data$class = local({
        rows = do.call('paste', c(data, sep='\r'))
        with(
            rle(sort(rows)),
            rep(1:length(values), lengths)[rank(rows)] ) })

    data
    #   x y class
    # 1 2 2     3
    # 2 2 1     2
    # 3 2 1     2
    # 4 1 2     1
    # 5 2 2     3

this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:

    # dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
        mide=local({
            unique = unique(data)
            data = merge(data, cbind(unique, class=1:nrow(unique))) }))

    #   test elapsed
    # 1 waku   0.503
    # 2 mide   3.269

and for m = 10 and n = 1000 i get:

    #   test elapsed
    # 1 waku   0.571
    # 2 mide  15.836

while for m = 1000 and n = 10 i get:

    #   test elapsed
    # 1 waku   1.110
    # 2 mide   2.461

the type of the content should not have any impact on the ratio (pure
guess, no testing done). 

whether my approach is more intuitive is arguable.  note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order.  (and sorting would add a performance
penalty in the other case.)

my previous remarks about the treatment on NAs still apply;  the
do.call('paste', ... is taken from duplicated.data.frame.

regards,
vQ

>> Does this do what you want?
>>     
>>> x <- c(1,3,1)
>>> y <- c(2,4,2)
>>> z <- c(3,4,3)
>>> data <- data.frame(x,y,z)
>>> data.u <- unique(data)
>>> data.u
>>>       
>>   x y z
>> 1 1 2 3
>> 2 3 4 4
>>     
>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>> merge(data, data.u)
>>>       
>>   x y z set
>> 1 1 2 3   1
>> 2 1 2 3   1
>> 3 3 4 4   2
>>
>> You need to do a bit more work to get them back into the original row
>> order if that is essential.
>>