[R] Remove highly correlated variables from a data frame or matrix

Fri Nov 15 19:03:41 CET 2019

HI Peter,

Thank you for getting back to me and shedding light on this. I see
your point, doing Jim's method:

> keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3))
> ro246.lt.8<-calc.rho[keeprows,keeprows]
> ro246.lt.8[ro246.lt.8 == 1] <- NA
> (mmax <- max(abs(ro246.lt.8), na.rm=TRUE))
[1] 0.566

Which is good in general, correlations in my matrix  should not be
exceeding 0.8. I need to run Mendelian Rendomization on it later on so
I can not be having there highly correlated SNPs. But with Jim's
method I am only left with 17 SNPs (out of 246) and that means that
both pairs of highly correlated SNPs are removed and it would be good
to keep one of those highly correlated ones.

I tried to do your code:
> tree = hclust(1-calc.rho, method = "average")
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor
exceed 65536") :
  missing value where TRUE/FALSE needed

Please advise.

Thanks
Ana

On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder
<peter.langfelder using gmail.com> wrote:
>
> I suspect that you want to identify which variables are highly
> correlated, and then keep only "representative" variables, i.e.,
> remove redundant ones. This is a bit of a risky procedure but I have
> done such things before as well sometimes to simplify large sets of
> highly related variables. If your threshold of 0.8 is approximate, you
> could simply use average linkage hierarchical clustering with
> dissimilarity = 1-correlation, cut the tree at the appropriate height
> (1-0.8=0.2), and from each cluster keep a single representative (e.g.,
> the one with the highest mean correlation with other members of the
> cluster). Something along these lines (untested)
>
> tree = hclust(1-calc.rho, method = "average")
> clusts = cutree(tree, h = 0.2)
> clustLevels = sort(unique(clusts))
> representatives = unlist(lapply(clustLevels, function(cl)
> {
>   inClust = which(clusts==cl);
>   rho1 = calc.rho[inClust, inClust, drop = FALSE];
>   repr = inClust[ which.max(colSums(rho1)) ]
>   repr
> }))
>
> the variable representatives now contains indices of the variables you
> want to retain, so you could subset the calc.rho matrix as
> rho.retained = calc.rho[representatives, representatives]
>
> I haven't tested the code and it may contain bugs, but something along
> these lines should get you where you want to be.
>
> Oh, and depending on how strict you want to be with the remaining
> correlations, you could use complete linkage clustering (will retain
> more variables, some correlations will be above 0.8).
>
> Peter
>
> On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >
> > Hello,
> >
> > I have a data frame like this (a matrix):
> > head(calc.rho)
> >             rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995
> > rs56192520      0.903     0.268     0.327     0.327     0.327     0.582
> > rs3764410       0.928     0.276     0.336     0.336     0.336     0.598
> > rs145984817     0.975     0.309     0.371     0.371     0.371     0.638
> > rs1807401       0.975     0.309     0.371     0.371     0.371     0.638
> > rs1807402       0.975     0.309     0.371     0.371     0.371     0.638
> > rs35350506      0.975     0.309     0.371     0.371     0.371     0.638
> >
> > > dim(calc.rho)
> > [1] 246 246
> >
> > I would like to remove from this data all highly correlated variables,
> > with correlation more than 0.8
> >
> > I tried this:
> >
> > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))]
> > > dim(data)
> > [1] 246   0
> >
> > Can you please advise,
> >
> > Thanks
> > Ana
> >
> > But this removes everything.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.