[R] correlation between rows of data.frame

Christos Hatzis christos.hatzis at nuverabio.com
Fri Aug 1 21:53:01 CEST 2008


Eleni,

A way to do this is to group the data first using 'split' and then sapply
the dist function to this list.
The slower step will be the split which took a couple of minutes on my
laptop but sapply should not take more than a minute or so.

size <- 10000
df <- data.frame(
	id=rep(sample(1:100000,size=size),2),
	a=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
	b=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
	c=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE))

df$id=factor(df$id)

dfp <- split(df, df$id)
sapply(dfp, dist) 

-Christos

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Eleni Rapsomaniki
> Sent: Friday, August 01, 2008 2:45 PM
> To: r-help at r-project.org
> Subject: [R] correlation between rows of data.frame
> 
> Dear R users,
> 
> I need to come up with an efficient method to compute the 
> correlation (or at least, the euclidean distance if that's 
> easier) between specific rows in a data
> frame (46,232 rows,    29 columns). The pairs of rows between 
> which I want to
> find the correlation share a common value in one of the 
> columns. So for example, in the following  
> x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(
> NA,rnorm(10,0,1)),size=10000,
> replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
> x$id=factor(x$id)
> 
> I would want to compute the correlation between the two rows 
> (for cols a,b,c) that share the same id. Using a for loop and 
> dist() works but takes a long time (>1 hour, my RAM is
> 1Gb):
> p=NULL
>  for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}
> 
> Is there a more efficient way? I thought about apply/sapply 
> etc but I don't think they'll work for rows and can't think 
> of an intelligent way to make them work!
> The second problem is that I also need to know how many 
> degrees of freedom (ie non missing pairs of values) were used 
> in each correlation. Is there a way to also do this efficiently?
> 
> I hope this makes sense! Thank you all very much in advance!
> 
> Eleni
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>



More information about the R-help mailing list