# [R] Pairwise n for large correlation tables?

Christos Hatzis christos at nuverabio.com
Tue Aug 8 04:44:03 CEST 2006

```Hi,

You can use complete.cases
It should run faster than the code you suggested.

See following example:

x <- matrix(runif(30),10,3)

# introduce missing values
x[sample(1:10,3),1] <- NA
x[sample(1:10,3),2] <- NA
x[sample(1:10,3),3] <- NA

cor(x,use="pairwise.complete.obs")

n <- ncol(x)
n.na <- matrix(0, n, n)
for (i in seq(1, n)) {
n.na[i,i] <- sum( complete.cases(x[, i]) )
for (j in seq(i+1, length=n-i)) {
ok <- sum( complete.cases(x[, i], x[, j]) )
n.na[i,j] <- n.na[j,i] <- ok
}
}

HTH

-Christos

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Adam D. I. Kramer
Sent: Monday, August 07, 2006 10:04 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Pairwise n for large correlation tables?

Hello,

I'm using a very large data set (n > 100,000 for 7 columns), for which I'm
pretty happy dealing with pairwise-deleted correlations to populate my
correlation table. E.g.,

a <- cor(cbind(col1, col2, col3),use="pairwise.complete.obs")

...however, I am interested in the number of cases used to compute each cell
of the correlation table. I am unable to find such a function via google
searches, so I wrote one of my own. This turns out to be highly inefficient
(e.g., it takes much, MUCH longer than the correlations do). Any hints,
regarding other functions to use or ways to maket his speedier, would be
much appreciated!

pairwise.n <- function(df=stop("Must provide data frame!")) {
if (!is.data.frame(df)) {
df <- as.data.frame(df)
}
colNum <- ncol(df)
result <-
matrix(data=NA,nrow=colNum,ncol=ncolNum,dimnames=list(colnames(df),colnames(
df)))
for(i in 1:colNum) {
for (j in i:colNum) {
result[i,j] <- length(df[!is.na(df[i])&!is.na(df[j])])/colNum
}
}
result
}

--