[R] Compare two dataframes

Sat Dec 18 10:00:18 CET 2010

Hi Mark:

> However, if the dataframe contains non-unique rows (two rows with
> exactly the same values in each column) then the unique function will
> delete one of them and that may not be desirable.

In order to get information about equal rows between two dataframes
without removing duplicated rows in each of them, it is possible to
use sorting. For example

  n <- ncol(cars)
  cars1 <- cbind(cars[1:35, ], df="df1")
  cars2 <- cbind(cars[16:50, ], df="df2")
  cars.all <- rbind(cars1, cars2) # all cases together, column "df" indicates origin of each case
  row.names(cars.all) <- seq(nrow(cars.all))
  cars.sorted <- cars.all[do.call(order, cars.all), ]
  # compute an index, which is the same for rows, which are equal except of the "df" component.
  index <- cumsum(1 - duplicated(cars.sorted[, 1:n]))
  # for each index of a unique row, compute the number of occurrences in both dataframes
  out <- table(index, cars.sorted$df)
  out[15:20, ]

  index df1 df2
     15   1   0
     16   1   1
     17   2   2
     18   1   1
     19   1   1
     20   1   1

This shows, for example, that the row with index 17 has 2 occurrences in both
dataframes. These rows can be obtained using

  cars.sorted[index == 17, ]

     speed dist  df
  17    13   34 df1
  18    13   34 df1
  37    13   34 df2
  38    13   34 df2