[R] counting the occurrences of vectors
Marc Schwartz
MSchwartz at MedAnalytics.com
Tue Jul 6 01:08:34 CEST 2004
On Sun, 2004-07-04 at 19:28, Spencer Graves wrote:
> I see a case where "f1" gives the wrong answer:
>
> b <- array(c("a:b", "a", "c", "b:c"), dim=c(2,2))
> a <- b[c(1,1),]
>
> For these two matrices, f1(a,b) == c(2,2), while f2(a,b) ==
> c(2,0). If b does not contain ":", e.g., if it is numeric, then this
> pathology can not occur. However, if "f1" is used with objects of class
> character or string that could contain the "collapse" character, it
> could give an incorrect answer without warning.
Greetings,
After seeing Gabor and Spencer's replies, I of course realized that my
initial reply was not entirely what Ravi was looking for. :-)
However, after seeing Spencer's example above, the thing that I also
noted was the likely overhead involved in paste()ing together the rows
to create objects that could then be tabulated. This is likely to become
more of an issue as the matrix size grows.
It came to me that with a modest modification to my initial function,
combined with Gabor's approach to tabulation, a new function could be
created that avoids the paste()ing overhead:
row.match.count <- function(m1, m2)
{
if (ncol(m1) != (ncol(m2)))
stop("Matrices must have the same number of columns")
if (typeof(m1) != (typeof(m2)))
stop("Matrices must have the same data type")
m1.l <- as.character(apply(m1, 1, list))
m2.l <- as.character(apply(m2 ,1, list))
# return counts for each row in m1.l in m2.l
table(c(unique(m1.l), m2.l))[m1.l] - 1
}
Using Gabor's original two matrices:
set.seed(1)
a <- matrix(sample(3,1000,rep=T),nc=5)
b <- matrix(sample(3,100,rep=T),nc=5)
We can then do (Count rows from 'b' in 'a'):
> gc(); system.time(ans <- row.match.count(b, a))
used (Mb) gc trigger (Mb)
Ncells 541226 14.5 741108 19.8
Vcells 141364 1.1 786432 6.0
[1] 0.01 0.00 0.00 0.00 0.00
Now...the downside to this approach is that the actual output of the
function, due to the coercion, is a wee bit ugly (OK, more than a wee
bit...)
For example, using Spencer's two matrices above, we get:
b <- array(c("a:b", "a", "c", "b:c"), dim=c(2,2))
a <- b[c(1,1),]
> row.match.count(b, a)
list(c("a:b", "c")) list(c("a", "b:c"))
2 0
Go back to my two matrices:
> m <- matrix(1:20, ncol = 4, byrow = TRUE)
> n <- matrix(1:40, ncol = 4, byrow = TRUE)
> row.match.count(m, n)
list(as.integer(c(1, 2, 3, 4))) list(as.integer(c(5, 6, 7, 8)))
1 1
list(as.integer(c(9, 10, 11, 12))) list(as.integer(c(13, 14, 15, 16)))
1 1
list(as.integer(c(17, 18, 19, 20)))
1
So, since we have a few extra CPU cycles to use, we could include some
sub()s to clean up the names in the resultant table:
row.match.count <- function(m1, m2)
{
if (ncol(m1) != (ncol(m2)))
stop("Matrices must have the same number of columns")
if (typeof(m1) != (typeof(m2)))
stop("Matrices must have the same data type")
m1.l <- as.character(apply(m1, 1, list))
m2.l <- as.character(apply(m2 ,1, list))
# return counts for each m1.l in m2.l
match.table <- table(c(unique(m1.l), m2.l))[m1.l] - 1
# clean up table names
if (typeof(m1) == "integer")
{
names(match.table) <- sub("^list\\(as.integer\\(", "",
names(match.table))
names(match.table) <- sub("\\)\\)$", "", names(match.table))
}
else if (typeof(m1) == "character")
{
names(match.table) <- sub("^list\\(", "", names(match.table))
names(match.table) <- sub("\\)$", "", names(match.table))
}
match.table
}
Somebody with more regex insight than I could probably clean up the
latter part of the function, but it seems to work well.
That being said, we now get:
> row.match.count(m, n)
c(1, 2, 3, 4) c(5, 6, 7, 8) c(9, 10, 11, 12) c(13, 14, 15, 16)
1 1 1 1
c(17, 18, 19, 20)
1
and
> row.match.count(b, a)
c("a:b", "c") c("a", "b:c")
2 0
Going back to Gabor's original two matrices, the addition of the names
clean up does not seem to add much overhead:
set.seed(1)
a <- matrix(sample(3,2000,rep=T),nc=10)
b <- matrix(sample(3,200,rep=T),nc=10)
> gc(); system.time(ans <- row.match.count(b, a))
used (Mb) gc trigger (Mb)
Ncells 541243 14.5 818163 21.9
Vcells 140464 1.1 786432 6.0
[1] 0.01 0.00 0.01 0.00 0.00
> ans
c(2, 1, 1, 1, 2) c(3, 3, 1, 3, 2) c(2, 1, 2, 3, 2) c(3, 3, 2, 1, 1)
1 1 3 1
c(1, 1, 1, 2, 3) c(1, 3, 2, 3, 3) c(2, 2, 2, 1, 2) c(2, 1, 1, 1, 1)
2 0 0 0
c(3, 2, 2, 3, 3) c(2, 3, 3, 2, 2) c(3, 2, 1, 1, 2) c(2, 2, 2, 1, 3)
2 1 0 2
c(1, 2, 2, 2, 1) c(3, 3, 3, 2, 1) c(2, 2, 3, 3, 3) c(3, 1, 1, 2, 3)
1 0 3 1
c(3, 2, 3, 3, 1) c(1, 2, 2, 1, 2) c(1, 3, 2, 2, 2) c(1, 1, 1, 2, 3)
0 1 0 2
I'd be curious to get any feedback on this and if someone has any
thoughts on any gotchas with this approach.
Thanks and I hope that this is of some help.
Marc Schwartz
More information about the R-help
mailing list