[R] which rows are duplicates?

Tue Mar 31 14:41:52 CEST 2009

Dimitris Rizopoulos wrote:
>
>>>
>>
>> another approach (maybe a bit cleaner) seems to be:
>>
>> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace = TRUE))
>>
>> vals <- do.call('paste', c(data, sep = '\r'))
>> data$class <- match(vals, unique(vals))
>> data
>>
>>
>> I have tried benchmarking it.
>
> sorry, I wanted to write: I have *not* tried benchmarking it.

    # dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(
	replications=100, 
	order='elapsed', 
	columns=c('test', 'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
	diri=local({
            values = do.call('paste', c(data, sep='\r'))
            data$class = match(values, unique(values)) }) )

        #  test elapsed
        # 2 diri    0.43
        # 1 waku    0.52

comparable for m=n=100 (and even better for n >> m), but way cleaner
code, and the class ids are now better sorted.  that's collaborative
problem solving ;)

best,
vQ