[R] finding duplicates in a data frame

Petr Savicky savicky at cs.cas.cz
Wed Jun 13 14:50:44 CEST 2012


On Wed, Jun 13, 2012 at 03:16:57AM -0700, sathya7priya wrote:
> I have two data frames which has 3 columns each.My first data frame is large
> like this below
> "new.col ppm.p. freq.p."
> "1_3_diaminopropane 3.13859 5.67516"
> "1_3_diaminopropane 3.137 6.65388"
> "1_3_diaminopropane 3.13541 8.0142"
> "1_3_diaminopropane 3.13383 9.64184"
> "1_3_diaminopropane 3.12075 298.243"
> "1_3_diaminopropane 3.1152 44.6212"
> "1_3_diaminopropane 3.10528 337.852"
> "1_3_diaminopropane 3.09617 44.1467"
> "1_3_diaminopropane 3.08943 308.2"
> "1_3_diaminopropane 3.0807 7.47272"
> "1_3_diaminopropane 3.07912 5.6996"
[...]
> "2_amino_5_ethyl_1_3_4_thiadiazole 1.15306 116.661"
> "2_amino_5_ethyl_1_3_4_thiadiazole 1.14513 64.8014"
> "2_amino_5_ethyl_1_3_4_thiadiazole 1.13681 45.9263"
> "2_amino_5_ethyl_1_3_4_thiadiazole 1.12848 35.0817"
> "2_amino_5_ethyl_1_3_4_thiadiazole 0.000156828 127.55"
> 
> 
> And my second dataframe is like query which has limited rows
> "new.col ppm.p. freq.p."
> "unknown" 7.44687 7.1684
> "unknown" 4.81412 105.11
> I want to compare the second and third columns of both dataframe and see
> whether there are any identical values in them.
> My expected answer is that the second dataframe is similar to  values of
> 1_amino_1_phenylmethyl_phosphonic_acidpeak  in data frame 1.

Hi.

If you look for similar and not identical values, then it is possible
to specify a tolerance and use sum of squares distance. Since the second
data frame is not large, a loop may be used. For example

  # some data
  base <- data.frame(x1=letters[1:5], x2=seq(1, 2, length=5), x3=seq(1.2, 1.8, length=5))
  observed <- data.frame(x1=letters[6:8], x2=c(1.4, 1.01, 1.27), x3=c(1.6, 1.21, 1.37))

  # choose tolerance
  eps <- 0.05

  # inspect data
  mat1 <- as.matrix(base[, 2:3])
  mat2 <- as.matrix(observed[, 2:3])
  for (i in seq.int(length=nrow(observed))) {
      j <- which(rowSums(sweep(mat1, MARGIN=2, mat2[i, ])^2) <= eps^2)
      if (length(j) >= 1) cat("row", i, "is similar to row:", j, "\n")
  }

Hope this helps.

Petr Savicky.



More information about the R-help mailing list