[R] Removing duplicated rows within a matrix, with missing data as wildcards

Fri Mar 9 10:18:30 CET 2007

Hi again,

Your problem as you formulated it is not clearly defined.
For example, what do you want to do with this matrix:

  > x <- matrix(c(1, NA, 3, NA, 2, 3), ncol=3, byrow=TRUE)
  > x
       [,1] [,2] [,3]
  [1,]    1   NA    3
  [2,]   NA    2    3

Remove row 1, row 2 or nothing?

Maybe you want to proceed in 2 steps:
  (1) remove strict duplicated rows
  (2) remove rows with at least 1 NA that match a row with no NAs

In this case you would not remove any row from x.

The removeLooseDupRows() function below does (2) only. If you
want (1) and (2), you need to combine it with unique() by doing
either removeLooseDupRows(unique(x)) or unique(removeLooseDupRows(x))
(both should always give the same result).

removeLooseDupRows <- function(x)
{
    if (nrow(x) <= 1)
        return(x)
    ii <- do.call("order",
                  args=lapply(seq_len(ncol(x)),
                              function(col) x[ , col]))
    dup_index <- logical(nrow(x))
    i0 <- -1
    for (k in 1:length(ii)) {
        i <- ii[k]
        if (any(is.na(x[i, ]))) {
            if (i0 == -1)
                next
            if (any(x[i, ] != x[i0, ], na.rm=TRUE))
                next
            dup_index[i] <- TRUE
        } else {
            i0 <- i
        }
    }
    x[!dup_index, ]
}

  > x <- matrix((1:3), 5, 3)
  > x[4,2] = NA
  > x[3,3] = NA
  > x
       [,1] [,2] [,3]
  [1,]    1    3    2
  [2,]    2    1    3
  [3,]    3    2   NA
  [4,]    1   NA    2
  [5,]    2    1    3

  > removeLooseDupRows(x)
       [,1] [,2] [,3]
  [1,]    1    3    2
  [2,]    2    1    3
  [3,]    3    2   NA
  [4,]    2    1    3

  > removeLooseDupRows(unique(x))
       [,1] [,2] [,3]
  [1,]    1    3    2
  [2,]    2    1    3
  [3,]    3    2   NA

Cheers,
H.

Quoting hpages at fhcrc.org:

> Quoting Petr Pikal <petr.pikal at precheza.cz>:
> 
> > Hi
> > 
> > its a bit tricky but
> > 
> > dup<-apply(x, 2, duplicated) #which are dupplucated
> > isna<-apply(x, 2, is.na) #which are na
> > check<-dup|isna # which are both
> > 
> > and here is your result
> > 
> > x[rowSums(check)!=3,]
> >      [,1] [,2] [,3]
> > [1,]    1    3    2
> > [2,]    2    1    3
> > [3,]    3    2   NA
> 
> Hi,
> 
> The above doesn't work. No need to have NAs in x:
> 
>   > x <- matrix(c(2,2,1,3,2,3), ncol=2, byrow=TRUE)
>   > x
>        [,1] [,2]
>   [1,]    2    2
>   [2,]    1    3
>   [3,]    2    3
> 
>   > dup <- apply(x, 2, duplicated)
>   > x[rowSums(check)!=2 ,]
>        [,1] [,2]
>   [1,]    2    2
>   [2,]    1    3
> 
> Look at 'dup':
> 
>   > dup
>         [,1]  [,2]
>   [1,] FALSE FALSE
>   [2,] FALSE FALSE
>   [3,]  TRUE  TRUE
> 
> Yes, each element in the last row is a duplicate in its own col,
> but this doesn't mean that the row as a whole is a duplicate.
> 
> Cheers,
> H.
> 
> 
> > 
> > 
> > Regards
> > Petr
> > 
> > 
> > 
> > 
> > On 8 Mar 2007 at 10:14, stacey thompson wrote:
> > 
> > Date sent:      	Thu, 8 Mar 2007 10:14:37 -0500
> > From:           	"stacey thompson" <stacey.lee.thompson at gmail.com>
> > To:             	r-help at stat.math.ethz.ch
> > Subject:        	[R] Removing duplicated rows within a matrix,
> > 	with missing data as wildcards
> > 
> > > I'd like to remove duplicated rows within a matrix, with missing data
> > > being treated as wildcards.
> > > 
> > > For example
> > > 
> > > > x <- matrix((1:3), 5, 3)
> > > > x[4,2] = NA
> > > > x[3,3] = NA
> > > > x
> > > 
> > >      [,1] [,2] [,3]
> > > [1,]    1    3    2
> > > [2,]    2    1    3
> > > [3,]    3    2   NA
> > > [4,]    1   NA    2
> > > [5,]    2    1    3
> > > 
> > > I would like to obtain
> > > 
> > >       [,1] [,2] [,3]
> > > [1,]    1    3    2
> > > [2,]    2    1    3
> > > [3,]    3    2   NA
> > > 
> > > >From the R-help archives, I learned about unique(x) and
> > > >duplicated(x).
> > > However, unique(x) returns
> > > 
> > > > unique(x)
> > > 
> > >      [,1] [,2] [,3]
> > > [1,]    1    3    2
> > > [2,]    2    1    3
> > > [3,]    3    2   NA
> > > [4,]    1   NA    2
> > > 
> > > and duplicated(x) gives
> > > 
> > > > duplicated(x)
> > > 
> > > [1] FALSE FALSE FALSE FALSE  TRUE
> > > 
> > > I have tried various na.action 's but with unique(x) I get errors at
> > > best.
> > > 
> > > e.g.
> > > > unique(x, na.omit(x))
> > > 
> > > Error: argument 'incomparables != FALSE' is not used (yet)
> > > 
> > > How I might tackle this?
> > > 
> > > Thanks,
> > > 
> > > -stacey
> > > 
> > > -- 
> > > -stacey lee thompson-
> > > Stagiaire post-doctorale
> > > Institut de recherche en biologie végétale
> > > Université de Montréal
> > > 4101 Sherbrooke Est
> > > Montréal, Québec H1X 2B2 Canada
> > > stacey.thompson at umontreal.ca
> > > 
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html and provide commented,
> > > minimal, self-contained, reproducible code.
> > 
> > Petr Pikal
> > petr.pikal at precheza.cz
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>