[R] Removing duplicated rows within a matrix, with missing data as wildcards

stacey thompson stacey.lee.thompson at gmail.com
Fri Mar 9 15:09:10 CET 2007


Hi H.,

Your response has improved the clarity of my thinking.  Kind thanks.
Also, your use of seq_len() prompted me to update from R version 2.3.1
on this machine.

For your matrix

 > x <- matrix(c(1, NA, 3, NA, 2, 3), ncol=3, byrow=TRUE)
 > x
      [,1] [,2] [,3]
 [1,]    1   NA    3
 [2,]   NA    2    3

I would want to delete either x[1,] or x[2,] but not both.
Practically, your "removeLooseDupRows(x)"

removeLooseDupRows <- function(x)
{
   if (nrow(x) <= 1)
       return(x)
   ii <- do.call("order",
                 args=lapply(seq_len(ncol(x)),
                             function(col) x[ , col]))
   dup_index <- logical(nrow(x))
   i0 <- -1
   for (k in 1:length(ii)) {
       i <- ii[k]
       if (any(is.na(x[i, ]))) {
           if (i0 == -1)
               next
           if (any(x[i, ] != x[i0, ], na.rm=TRUE))
               next
           dup_index[i] <- TRUE
       } else {
           i0 <- i
       }
   }
   x[!dup_index, ]
}

should leave no such ambiguous cases for my data, as the nrow(x) are
very high with few NA in each x.  For example, a row of (1, 2, 3) is
very likely to exist in my data.

However, to find the row numbers of any remaining ambiguous matches,
should they exist, using example:

> x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1, 3), ncol=3, byrow=TRUE)
> x
     [,1] [,2] [,3]
[1,]    1   NA    3
[2,]   NA    2    3
[3,]    1    3    2
[4,]    2    1    3
[5,]    1   NA    2
[6,]    2    1    3

after your suggested

> removeLooseDupRows(x)
     [,1] [,2] [,3]
[1,]    1   NA    3
[2,]   NA    2    3
[3,]    1    3    2
[4,]    2    1    3
[5,]    2    1    3

> q <- removeLooseDupRows(unique(x))
> q
     [,1] [,2] [,3]
[1,]    1   NA    3
[2,]   NA    2    3
[3,]    1    3    2
[4,]    2    1    3

I could

> # ambiguous matches in matrix form
> apply(q, 1, function(row1) apply(q, 1, function(row2) all(is.na(row1) | is.na(row2) | row1==row2)))

      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE  TRUE FALSE FALSE
[2,]  TRUE  TRUE FALSE FALSE
[3,] FALSE FALSE  TRUE FALSE
[4,] FALSE FALSE FALSE  TRUE

> # indices of ambiguous matches
> m <- which(apply(q, 1, function(row1) apply(q, 1, function(row2) all(is.na(row1) | is.na(row2) | row1==row2))), arr=T)
> m
     row col
[1,]   1   1
[2,]   2   1
[3,]   1   2
[4,]   2   2
[5,]   3   3
[6,]   4   4

> #put in order and omit duplicates
> m2 <- unique(t(apply(m, 1, sort)))
> m2
     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    2    2
[4,]    3    3
[5,]    4    4

> # show the ambiguous matches
> m2[m2[,1]!=m2[,2], drop=F]
[1] 1 2

...and procede from there.

This solution came from another helpful "R-help" respondant to my
poorly-defined problem.

Appreciative thanks to everyone for your instructive help.

Cheers,
stacey

-- 
-stacey lee thompson-
Stagiaire post-doctorale
Institut de recherche en biologie végétale
Université de Montréal
4101 Sherbrooke Est
Montréal, Québec H1X 2B2 Canada
stacey.thompson at umontreal.ca



More information about the R-help mailing list