[R] find high correlated variables in a big matrix

David Winsemius dwinsemius at comcast.net
Fri May 6 23:32:20 CEST 2016


> On May 6, 2016, at 2:12 PM, Lida Zeighami <lid.zigh at gmail.com> wrote:
> 
> Hi there,
> 
> Is there any way to find out high correlated variables among a big matrix?
> for example I have a matrix called data= 2000*5000 and I need to find the
> high correlated variables between the variables in the columns! (Need 100
> high correlated variables from 5000 variables in column)
> 
> I could calculate the correlation matrix and pick the high correlated ones
> but my problem is, I just can pick pairs of variables with high correlation
> and may be we have low correlation across the pairs! Means, in my 100*100
> correlation matrix, there are some pairs with low correlation and I
> couldn't find the 100 variables which they all have high correlation
> together!!!
> Would you please ley me know if there is any way?

The rcorr function in Hmisc will return a list whose first element is a correlation matrix

> base <- rnorm(100)

> test <- matrix(base+0.2*rnorm(300), 100)

> rcorr(test)[[1]]
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.9631220 0.9721688
[2,] 0.9631220 1.0000000 0.9666564
[3,] 0.9721688 0.9666564 1.0000000

You can use which to to find the locations meeting a criterion (or two):

> mycorr <- .Last.value

> which(mycorr > 0.97 & mycorr != 1, arr.ind=TRUE)
     row col
[1,]   3   1
[2,]   1   3



-- 

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list