[R] which rows are duplicates?

Tue Mar 31 14:29:18 CEST 2009

Wacek Kusnierczyk wrote:
> Wacek Kusnierczyk wrote:
>> Michael Dewey wrote:
>>   
>>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>>     
>>>> I would like to know which rows are duplicates of each other, not
>>>> simply that a row is duplicate of another row. In the following
>>>> example rows 1 and 3 are duplicates.
>>>>
>>>>       
>>>>> x <- c(1,3,1)
>>>>> y <- c(2,4,2)
>>>>> z <- c(3,4,3)
>>>>> data <- data.frame(x,y,z)
>>>>>         
>>>>     x y z
>>>> 1 1 2 3
>>>> 2 3 4 4
>>>> 3 1 2 3
>>>>       
>> i don't have any solution significantly better than what you have
>> already been given.  
> 
> i now seem to have one:
> 
>     # dummy data
>     data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
> replace=TRUE))
>    
>     # add a class column; identical rows have the same class id
>     data$class = local({
>         rows = do.call('paste', c(data, sep='\r'))
>         with(
>             rle(sort(rows)),
>             rep(1:length(values), lengths)[rank(rows)] ) })
> 
>     data
>     #   x y class
>     # 1 2 2     3
>     # 2 2 1     2
>     # 3 2 1     2
>     # 4 1 2     1
>     # 5 2 2     3
> 

another approach (maybe a bit cleaner) seems to be:

data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5, 
replace = TRUE))

vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data

I have tried benchmarking it.

Best,
Dimitris

> this approach seems to be roughly comparable to michael's, depending on
> the shape (and size?) of the input:
> 
>     # dummy data frame, just integers
>     n = 100; m = 100
>     data = as.data.frame(
>         matrix(nrow=n, ncol=m,
>             sample(n, m*n, replace=TRUE)))
> 
>     # do a simple benchmarking
>     library(rbenchmark)
>     benchmark(replications=100, order='elapsed', columns=c('test',
> 'elapsed'),
>         waku=local({
>             rows = do.call('paste', c(data, sep='\r'))
>             data$class = with(
>                 rle(sort(rows)),
>                 rep(1:length(values), lengths)[rank(rows)] ) }),
>         mide=local({
>             unique = unique(data)
>             data = merge(data, cbind(unique, class=1:nrow(unique))) }))
> 
>     #   test elapsed
>     # 1 waku   0.503
>     # 2 mide   3.269
> 
> and for m = 10 and n = 1000 i get:
> 
>     #   test elapsed
>     # 1 waku   0.571
>     # 2 mide  15.836
> 
> while for m = 1000 and n = 10 i get:
> 
>     #   test elapsed
>     # 1 waku   1.110
>     # 2 mide   2.461
> 
> the type of the content should not have any impact on the ratio (pure
> guess, no testing done). 
> 
> whether my approach is more intuitive is arguable.  note that, unlike in
> michael's solution, the final result (the data frame with a class column
> added) is in the original order.  (and sorting would add a performance
> penalty in the other case.)
> 
> my previous remarks about the treatment on NAs still apply;  the
> do.call('paste', ... is taken from duplicated.data.frame.
> 
> regards,
> vQ
> 
> 
> 
>>> Does this do what you want?
>>>     
>>>> x <- c(1,3,1)
>>>> y <- c(2,4,2)
>>>> z <- c(3,4,3)
>>>> data <- data.frame(x,y,z)
>>>> data.u <- unique(data)
>>>> data.u
>>>>       
>>>   x y z
>>> 1 1 2 3
>>> 2 3 4 4
>>>     
>>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>>> merge(data, data.u)
>>>>       
>>>   x y z set
>>> 1 1 2 3   1
>>> 2 1 2 3   1
>>> 3 3 4 4   2
>>>
>>> You need to do a bit more work to get them back into the original row
>>> order if that is essential.
>>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014