[R] compare two data frames of different dimensions and onlykeep unique rows

Arnaud Gaboury arnaud.gaboury at a2ct2.com
Tue Feb 28 14:11:10 CET 2012


TY very much for your setdiffDF(). It does the job perfectly.

Arnaud Gaboury
 
A2CT2 Ltd.


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Petr Savicky
Sent: lundi 27 février 2012 20:41
To: r-help at r-project.org
Subject: Re: [R] compare two data frames of different dimensions and onlykeep unique rows

On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
> No, but I tried your way too.
> 
> In fact, the only three unique rows are these ones:
> 
>  Product Price Nbr.Lots
>    Cocoa  2440        5
>    Cocoa  2450        1
>    Cocoa  2440        6
> 
> Here is a dirty working trick I found :
> 
> > df<-merge(exportfile,reported,all.y=T)
> > df1<-merge(exportfile,reported)
> > dff1<-do.call(paste,df)
> > dff<-do.call(paste,df)
> > dff1<-do.call(paste,df1)
> > df[!dff %in% dff1,]
>   Product Price Nbr.Lots
> 3   Cocoa  2440        5
> 4   Cocoa  2450        1
>  
> 
> My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy).

Hi.

Try the following.

  setdiffDF <- function(A, B)
  {
      A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
  }

  df1 <- setdiffDF(reported, exportfile)
  df2 <- setdiffDF(exportfile, reported)
  rbind(df1, df2)

I obtained

     Product Price Nbr.Lots
  3    Cocoa  2440        5
  4    Cocoa  2450        1
  31   Cocoa  2440        6

Is this correct? I see the row

  Cocoa  2440.00        6

only in exportfile and not in reported.

The trick with paste() is not a bad idea. A variant of it is used also in the base function duplicated.matrix(), since it contains

  apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

If speed is critical, then possibly the paste() trick written for the whole columns, for example

  paste(df[[1]], df[[2]], df[[3]], sep="\r")

and then setdiff() can be better.

Hope this helps.

Petr Savicky.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list