[R] Correlated Columns in data frame

Nataraj nataraj at biotech2.sastra.edu
Sat May 17 07:40:51 CEST 2008


Dear all,
Sorry to post my query once again in the list, since I did
not get attention from anyone in my previous mail to this
list. 
Now I make it simple here that please give me a code for
find out the columns of a dataframe whose correlation
coefficient is below a pre-determined threshold. (For
detailed query please see my previous message to this list,
pasted hereunder)

Thanks and regards,
B.Nataraj

Following is my previous message to this list to which I do
not get any reply.

Dear all,
For removing correlated columns in a data frame,df.
I found a code written in R in the page
http://cheminfo.informatics.indiana.edu/~rguha/code/R/ of
Mr.Rajarshi Guha. 
The code is 
#################
r2test <- function(df, cutoff=0.8) {
  if (cutoff > 1 || cutoff <= 0) {
    stop(" 0 <= cutoff < 1")
  }
  if (!is.matrix(d) && !is.data.frame(d)) {
    stop("Must supply a data.frame or matrix")
  }
  r2cut = sqrt(cutoff);
  cormat <- cor(d);
  bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
  bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]],
ncol=2);
  drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
bad.idx[,1], bad.idx [,2]);
  if (length(drop.idx) == 0) {
      1:ncol(d)
  } else {
      (1:ncol(d))[-unique(drop.idx)]
  }
}
############################################
Now the problem is the code return different output (i.e.
different column number) for a different call. I could not
understood why it happens from that code, but I can
understand the logic in code except the line
********************************************
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1],
bad.idx [,2]);
****************************************
what it means by comparing > 0.5 of nrow(bad.idx).
So I am looking for anyone to help me for different output
generation between the different function call as well as
 meaning of the line which I mentioned above.

Thanks!
B.Nataraj



More information about the R-help mailing list