[R] Remove highly correlated variables from a data frame or matrix

Jim Lemon drj|m|emon @end|ng |rom gm@||@com
Thu Nov 14 22:18:28 CET 2019


Hi Ana,
Rather than addressing the question of why you want to do this, Let's
get make the question easier to answer:

calc.rho<-matrix(c(0.903,0.268,0.327,0.327,0.327,0.582,
0.928,0.276,0.336,0.336,0.336,0.598,
0.975,0.309,0.371,0.371,0.371,0.638,
0.975,0.309,0.371,0.371,0.371,0.638,
0.975,0.309,0.371,0.371,0.371,0.638,
0.975,0.309,0.371,0.371,0.371,0.638),ncol=6,byrow=TRUE)
rnames<-c("rs56192520","rs3764410","rs145984817","rs1807401",
"rs1807402","rs35350506")
rownames(calc.rho)<-rnames
cnames<-c("rs9900318","rs8069906","rs9908521","rs9908336",
"rs9908870","rs9895995")
colnames(calc.rho)<-cnames

Now if you  just want a vector of the values less than 0.8, it's trivial:

calc.rho[calc.rho<0.8]

However, based on your previous questions, I suspect you want
something else. Maybe the pairs of row/column names that correspond to
the values less than 0.8. To ensure that you haven't tricked us by not
including columns in which values range around 0.8, I'll do it this
way:

# make the new variable name possible to decode
calc.lt.8<-calc.rho<0.8
varnames.lt.8<-data.frame(var1=NA,var2=NA)
for(row in 1:nrow(calc.rho)) {
 for(col in 1:ncol(calc.rho))
  if(calc.lt.8[row,col])
   varnames.lt.8<-rbind(varnames.lt.8,c(rnames[row],cnames[col]))
}
# now get rid of the first row of NA values
varnames.lt.8<-varnames.lt.8[-1,]

Clunky, but effective. You now have those variable pairs that you may
want. Let us know in the next episode of this soap operation.

Jim

On Fri, Nov 15, 2019 at 5:50 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>
> Hello,
>
> I have a data frame like this (a matrix):
> head(calc.rho)
>             rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995
> rs56192520      0.903     0.268     0.327     0.327     0.327     0.582
> rs3764410       0.928     0.276     0.336     0.336     0.336     0.598
> rs145984817     0.975     0.309     0.371     0.371     0.371     0.638
> rs1807401       0.975     0.309     0.371     0.371     0.371     0.638
> rs1807402       0.975     0.309     0.371     0.371     0.371     0.638
> rs35350506      0.975     0.309     0.371     0.371     0.371     0.638
>
> > dim(calc.rho)
> [1] 246 246
>
> I would like to remove from this data all highly correlated variables,
> with correlation more than 0.8
>
> I tried this:
>
> > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))]
> > dim(data)
> [1] 246   0
>
> Can you please advise,
>
> Thanks
> Ana
>
> But this removes everything.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list