[R] create a pairwise coocurrence matrix

Thu Nov 11 07:39:54 CET 2010

Hi Tax,

I played around with several different functions.  I keep thinking
that there should be an easier/faster way, but this is what I came up
with.  Given the speed tests, it looks like foo4 is the best option
(they all give identical results).

#### The functions ####

foo1 <- function(object) {
  object <- object == 1
  x <- ncol(object)
  vals <- expand.grid(1:x, 1:x)
  y <- colSums(object[, vals[, 1]] & object[, vals[, 2]])
  output <- matrix(y, ncol = x)
  return(output)
}

foo2 <- function(object) {
  x <- ncol(object)
  output <- matrix(0, nrow = x, ncol = x)
  for (i in 1:x) {
    for (j in 1:x) {
      output[i, j] <- sum(object[, i] & object[, j])
    }
  }
  return(output)
}

foo3 <- function(object) {
  object <- object == 1
  x <- ncol(object)
  output <- matrix(0, nrow = x, ncol = x)
  for (i in 1:x) {
    for (j in 1:x) {
      output[i, j] <- sum(object[, i] & object[, j])
    }
  }
  return(output)
}

foo4 <- function(object) {
  object <- object == 1
  output <- sapply(1:ncol(object), function(x) {
    colSums(object[, x] & object)
  })
  return(output)
}

foo5 <- function(object) {
  output <- sapply(1:ncol(object), function(x) {
    colSums(object[, x] & object)
  })
  return(output)
}

#### The test data ####
set.seed(2213)
dat <- sample(0:1, 10000, replace = TRUE)
test1 <- matrix(dat, ncol = 10)
test2 <- matrix(dat, ncol = 100)
test3 <- matrix(dat, ncol = 1000)

#### Results ####
     10 cols 100 cols 1000 cols
foo1   0.012    0.586     2.336
foo2   0.013    0.338    20.285
foo3   0.014    0.313    19.550
foo4   0.007    0.065     0.689
foo5   0.008    0.070     0.731

Notice that when I used the same data but varied the number of
columns, some functions were more or less dramatically influenced.  If
the minimal gain between foo4 & foo5 is not important to you, I might
suggest this for simplicity (essentially foo5 without the unnecessary
wrapping).

sapply(1:ncol(object), function(x) {colSums(object[, x] & object)})

For example, using your data:

## This data was read in from your email and then conveniently
## provided using dput(tmp) from my system
tmp <- structure(list(t1 = c(1L, 1L, 1L), t2 = c(1L, 1L, 0L), t3 = c(0L,
0L, 0L), t4 = c(0L, 1L, 0L), t5 = c(1L, 1L, 1L)), .Names = c("t1",
"t2", "t3", "t4", "t5"), class = "data.frame", row.names = c("d1",
"d2", "d3"))

sapply(1:ncol(tmp), function(x) {colSums(tmp[, x] & tmp)})

   [,1] [,2] [,3] [,4] [,5]
t1    3    2    0    1    3
t2    2    2    0    1    2
t3    0    0    0    0    0
t4    1    1    0    1    1
t5    3    2    0    1    3

Cheers,

Josh

On Wed, Nov 10, 2010 at 5:03 PM, tax botsis <taxbotsis at gmail.com> wrote:
> Thanks Josh,
>
> here is the table with binary values showing the presence or absence of a
> certain term (t1, t2, t3, t4, and t5) in a document (d1, d2 and d3):
>
>      [t1] [t2] [t3] [t4] [t5]
> [d1]    1    1    0   0   1
> [d2]    1    1    0   1   1
> [d3]    1    0    0   0   1
>
>
> and here is the (adjacency) matrix I would like to get that calculates the
> coocurrencies of each pair of terms in all documents:
>
>      [t1] [t2] [t3] [t4] [t5]
> [t1]   0   2   0    1    3
> [t2]   2   0   0    1    2
> [t3]   0   0   0    0    0
> [t4]   1   1   0    0    1
> [t5]   3   2   0   1    1
>
> Please let me know whether this reads better or whether I should post it
> again
>
> Thanks
> Tax
>
> 2010/11/10 Joshua Wiley <jwiley.psych at gmail.com>
>>
>> Hi Tax,
>>
>> Because the list dost not accept HTML messages (per posting guide),
>> your message was converted to plain text, and your table is difficult
>> to read.  My suggestion would be to start with:
>>
>> ?table
>> ?xtabs
>>
>> If you make up a minimal example of the data you have, and email it to
>> us we can give more detailed and specific help.  Suppose your data is
>> stored under the name, "dat", you can easily provide us the data using
>> the R function, dput().  For example:
>>
>> dput(dat)
>>
>> will give you a bunch of output you can simply copy and paste into
>> your next plain text email.
>>
>> Best regards,
>>
>> Josh
>>
>>
>> On Wed, Nov 10, 2010 at 12:01 PM, tax botsis <taxbotsis at gmail.com> wrote:
>> > Hi all,
>> > I am trying to construct a pairwise coocurrence matrix for certain terms
>> > appearing in a number of documents. For example I have the following
>> > table
>> > with binary values showing the presence or absence of a certain term in
>> > a
>> > document:
>> >
>> >     term1 term2 term3 term4 term5 doc1 1 1 0 0 1 doc2 1 1 0 1 1 doc3 1 0
>> > 0
>> > 0 1
>> >
>> > And I want to have a matrix with the number of the pairwise
>> > coocurrencies.
>> > So, based on the above table the matrix should be:
>> >
>> >     term1 term2 term3 term4 term5 term1 0 2 0 1 3 term2 2 0 0 1 2 term3
>> > 0 0
>> > 0 0 0
>> >
>> > term4
>> > 1 1 0 0 1
>> >
>> > term5
>> > 3 2 0 1 1
>> > Any ideas on how to do that?
>> >
>> > Tahnks
>> > Tax
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> University of California, Los Angeles
>> http://www.joshuawiley.com/
>
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/