[R] compare two data frames of different dimensions and only keep unique rows

Rui Barradas rui1174 at sapo.pt
Tue Feb 28 02:05:41 CET 2012


Hello,

I've made Petr's solution a bit more general


Petr Savicky wrote
> 
> On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
>> No, but I tried your way too.
>> 
>> In fact, the only three unique rows are these ones:
>> 
>>  Product Price Nbr.Lots
>>    Cocoa  2440        5
>>    Cocoa  2450        1
>>    Cocoa  2440        6
>> 
>> Here is a dirty working trick I found :
>> 
>> > df<-merge(exportfile,reported,all.y=T)
>> > df1<-merge(exportfile,reported)
>> > dff1<-do.call(paste,df)
>> > dff<-do.call(paste,df)
>> > dff1<-do.call(paste,df1)
>> > df[!dff %in% dff1,]
>>   Product Price Nbr.Lots
>> 3   Cocoa  2440        5
>> 4   Cocoa  2450        1
>>  
>> 
>> My two problems are : I do think it is not so a clean code, then I won't
>> know by advance which of my two df will have the greates dimension (I can
>> add some lines to deal with it, but again, seems very heavy).
> 
> Hi.
> 
> Try the following.
> 
>   setdiffDF <- function(A, B)
>   {
>       A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
>   }
> 
>   df1 <- setdiffDF(reported, exportfile)
>   df2 <- setdiffDF(exportfile, reported)
>   rbind(df1, df2)
> 
> I obtained
> 
>      Product Price Nbr.Lots
>   3    Cocoa  2440        5
>   4    Cocoa  2450        1
>   31   Cocoa  2440        6
> 
> Is this correct? I see the row
> 
>   Cocoa  2440.00        6
> 
> only in exportfile and not in reported.
> 
> The trick with paste() is not a bad idea. A variant of
> it is used also in the base function duplicated.matrix(),
> since it contains
> 
>   apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
> 
> If speed is critical, then possibly the paste() trick
> written for the whole columns, for example
> 
>   paste(df[[1]], df[[2]], df[[3]], sep="\r")
> 
> and then setdiff() can be better.
> 
> Hope this helps.
> 
> Petr Savicky.
> 
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

It produces the symmetric difference for vectors, matrices, data.frames and
(so-so tested) lists.

#-----------------------------
# First the set difference

`%-%` <- function(x, y) UseMethod("%-%")
`%-%.default` <- function(x, y){
	f <- function(A, B)
      	!duplicated(c(B, A))[length(B) + 1:length(A)]
	ix <- f(x, y)
	x[ix]
}
`%-%.matrix` <- `%-%.data.frame` <- function(x, y){
	f <- function(A, B)
      	!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)]
	ix <- f(x, y)
	x[ix, ]
}
`%-%.list` <- function(x, y){
	f <- function(A, B)
		if(class(A) == class(B)) A %-% B
	lapply(y, function(Y) lapply(x, f, Y))
}

# Then the set symmetric difference
symdiff <- function(x, y)  UseMethod("symdiff")
symdiff.default <- function(x, y)
	c(x %-% y, y %-% x)
symdiff.matrix <- symdiff.data.frame <- function(x, y){
	xclass <- class(x)
	res <- rbind(x %-% y, y %-% x)
	class(res) <- xclass
	res
}
symdiff.list <- function(x, y){
	f <- function(A, B)
		if(class(A) == class(B)) symdiff(A, B)
	lapply(y, function(Y) lapply(x, f, Y))
}

# Test it with data.frames first (the OP data)

reported %-% exportfile
exportfile %-% reported

symdiff(reported, exportfile)
symdiff(exportfile, reported)

#-----------------------------
# And some other data types

x <- 1:5
y <- 3:8
x %-% y
y %-% x
symdiff(x, y)
symdiff(y, x)

X <- list(a=x, rp=reported)
Y <- list(b=y, ef=exportfile)
X %-% Y
Y %-% X
symdiff(X, Y)
symdiff(Y, X)

P.S. This question seems to pop-up repeatedly

Rui Barradas


--
View this message in context: http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4426607.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list