[R] compare two data frames of different dimensions and only keep unique rows

Petr Savicky savicky at cs.cas.cz
Mon Feb 27 20:40:49 CET 2012


On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
> No, but I tried your way too.
> 
> In fact, the only three unique rows are these ones:
> 
>  Product Price Nbr.Lots
>    Cocoa  2440        5
>    Cocoa  2450        1
>    Cocoa  2440        6
> 
> Here is a dirty working trick I found :
> 
> > df<-merge(exportfile,reported,all.y=T)
> > df1<-merge(exportfile,reported)
> > dff1<-do.call(paste,df)
> > dff<-do.call(paste,df)
> > dff1<-do.call(paste,df1)
> > df[!dff %in% dff1,]
>   Product Price Nbr.Lots
> 3   Cocoa  2440        5
> 4   Cocoa  2450        1
>  
> 
> My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy).

Hi.

Try the following.

  setdiffDF <- function(A, B)
  {
      A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
  }

  df1 <- setdiffDF(reported, exportfile)
  df2 <- setdiffDF(exportfile, reported)
  rbind(df1, df2)

I obtained

     Product Price Nbr.Lots
  3    Cocoa  2440        5
  4    Cocoa  2450        1
  31   Cocoa  2440        6

Is this correct? I see the row

  Cocoa  2440.00        6

only in exportfile and not in reported.

The trick with paste() is not a bad idea. A variant of
it is used also in the base function duplicated.matrix(),
since it contains

  apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

If speed is critical, then possibly the paste() trick
written for the whole columns, for example

  paste(df[[1]], df[[2]], df[[3]], sep="\r")

and then setdiff() can be better.

Hope this helps.

Petr Savicky.



More information about the R-help mailing list