[R] counting row repetitions without loop

Douglas Bates bates at stat.wisc.edu
Wed Feb 6 20:15:48 CET 2008


On Feb 6, 2008 8:08 AM, Waterman, DG (David)
<david.waterman at diamond.ac.uk> wrote:
> Hi,

> I have a data frame consisting of coordinates on a 10*10 grid, i.e.

> > example
>     x  y
> 1   4  5
> 2   6  7
> 3   6  6
> 4   7  5
> 5   5  7
> 6   6  7
> 7   4  5
> 8   6  7
> 9   7  6
> 10  5  6

> What I would like to do is return an 10*10 matrix consisting of counts
> at each position, so in the above example I would have a matrix where,
> for example, cell [4,5] contains 2 and [6,7] contains 3. At the moment I
> have implemented this using a for loop over the rows of the data frame,
> however the data frames I want to process are very long so the loop
> takes many minutes to complete. Can I do this in a more efficient way?

What you are describing is essentially a cross-tabulation so you could use

> examp
   x y
1  4 5
2  6 7
3  6 6
4  7 5
5  5 7
6  6 7
7  4 5
8  6 7
9  7 6
10 5 6
> xtabs(~ x + y, examp)
   y
x   5 6 7
  4 2 0 0
  5 0 1 1
  6 0 1 3
  7 1 1 0

This omits the rows and columns which are completely empty but you can
work around that.

If you have a very large collection of such pairs to summarize you
could consider the version of xtabs in the Matrix package that allows
for the argument sparse = TRUE.  That uses conversion of the "triplet"
form of a sparse matrix to the compressed column for to do the
counting.

If you want to do this without converting the integers in 'x' and 'y'
to factors you can use a distinctly unobvious function like

library(Matrix)
sparsetab <- function(x, y)
{
    x <- as.integer(x)
    y <- as.integer(y)
    stopifnot(length(x) == length(y))
    lx <- length(x)
    mx <- max(x)
    my <- max(y)
    as(new("dgTMatrix", i = x - 1L, j = y - 1L,
           x = rep(1, length(x)), Dim = c(mx, my),
           Dimnames = list(1:mx,1:my)), "dgCMatrix")
}

which produces

> with(examp, sparsetab(x, y))
7 x 7 sparse Matrix of class "dgCMatrix"
  1 2 3 4 5 6 7
1 . . . . . . .
2 . . . . . . .
3 . . . . . . .
4 . . . . 2 . .
5 . . . . . 1 1
6 . . . . . 1 3
7 . . . . 1 1 .

One reason to use such a function instead of xtabs is because xtabs
will convert 'x' and 'y' to factors and the default ordering of the
levels is lexicographic so '11' occurs before '2'.  Again, you can get
around that but the function shown above is more direct and should be
fast enough for most any application.



More information about the R-help mailing list