[R] Predicting correlated responses using canonical correlations in multivariate least squares

Sat Feb 27 10:38:14 CET 2016

I have a considerable interest in trying to improve the predictions from 
10+ highly correlated predictors to two (2) highly correlated responses.

OLS does not allow me to take into account the correlation between the 
responses. Multivariate can take into account the correlations between 
the predictors.

I have tried to use canonical correlations (R function cancor). 
Canonical correlations nicely show the interrelations between the 
responses and predictors. Predicting the responses using cancor is a 
different matter. There is no predict method that supports cancor. 
Simple backsubstition leads to improved correlations but poor pairwise 
correspondence. (R code I use is below, for a 2 predictor, 2 response 
set of data, this is a book example (Frets, 1921 headsize data), my data 
is similar but can't be summarized in this post as easily)

The 1997 paper by Breiman and Friedman gives a procedure using shrinkage 
on least squares estimates obtained from canonical variates and their 
origins (the responses). They call this Curds and Whey. See  this post.

ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf

There is a master thesis (Kidd) which uses the above and promises an R 
package to handle the curds and whey procedures. That R package does not 
seem to have been made public though. See here:

http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd

Since the Breimam and Friedman paper is almost 20 years old I am 
wondering about their procedure. One would think that it would have been 
adopted if it is as powerful as it is claimed to be since correlated 
dependents are quite common. Is there a flaw in it that I do not know 
of? Is there anything better available today?

Any tips are most welcome.

Regards,
Alex van der Spek

#! usr/env/bin R
#CCA example try out
#Use data on head size of sons/brothers length and breadth
#Taken from the book by Everitt

#Clean slate
rm(list=ls())

## Panel function, put histograms on the diagonal
panel.hist <- function(x, ...)
{
     usr <- par("usr"); on.exit(par(usr))
     par(usr = c(usr[1:2], 0, 1.5) )
     h <- hist(x, plot = FALSE, breaks=21)
     breaks <- h$breaks; nB <- length(breaks)
     y <- h$counts; y <- y/max(y)
     rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}

## Panel function, put correlation in top
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
     usr <- par("usr"); on.exit(par(usr))
     par(usr = c(0, 1, 0, 1))
     r <- abs(cor(x, y))
     txt <- format(c(r, 0.123456789), digits=digits)[1]
     txt <- paste(prefix, txt, sep="")
     if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
     text(0.5, 0.5, txt, cex = cex.cor * r)
}

#Bring up file selection box, set working directory and get base name
fn <- file.choose()
fs <- basename(fn)
fp <- dirname(fn)

#Read data in by csv
hs.r <- read.csv(fn)

#This is the entire data set, 25 points, two predictors (x) and two 
responses (y)
> head(hs.r, 25)
     x1  x2  y1  y2
1  191 155 179 145
2  195 149 201 152
3  181 148 185 149
4  183 153 188 149
5  176 144 171 142
6  208 157 192 152
7  189 150 190 149
8  197 159 189 152
9  188 152 197 159
10 192 150 187 151
11 179 158 186 148
12 183 147 174 147
13 174 150 185 152
14 190 159 195 157
15 188 151 187 158
16 163 137 161 130
17 195 155 183 158
18 186 153 173 148
19 181 145 182 146
20 175 140 165 137
21 192 154 185 152
22 174 143 178 147
23 176 139 176 143
24 197 167 200 158
25 190 163 187 150

#Make a scatter plot matrix
pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)

#Sweep out summary stats
hs.s <- scale(hs.r, center=F, scale=T)
hs.c <- scale(hs.r, center=T, scale=F)
hs.a <- scale(hs.r, center=T, scale=T)

#Make a scatter plot matrix of centered and scaled, standardized data
pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)
hs.1 <- hs.a[,1:2]
hs.2 <- hs.a[,3:4]

cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F)

#Compute the decompositon
#u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] * hs.1[,'x2']
#v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] * hs.2[,'y2']
#u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] * hs.1[,'x2']
#v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] * hs.2[,'y2']

#Quicker and better
uu <- hs.1 %*% cc$xcoef
vv <- hs.2 %*% cc$ycoef

#Store in a dataframe
uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2])

#Make a scatter plot matrix
pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)

#Make a linear model, the significant coefs will prove to be the 
eigenvalues
lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv)
summary(lm.uv)

#Notation  x * k * A  == y * B, solve for y
A <- cc$xcoef
B <- cc$ycoef
r <- cc$cor

#Compute the back substitioon
C <-  A %*% diag(r) %*% solve(B)

yy <- hs.1 %*% C
colnames(yy) <- c('yy1', 'yy2')

#Show how this relates to the original responses, not bad
pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel = 
panel.cor, d

#But pairwise the correspondence is poor.
head(cbind(hs.2, yy)

> head(cbind(hs.2, yy))
              y1          y2         yy1          yy2
[1,] -0.4820596 -0.63189808  0.43234091  0.434382250
[2,]  1.7091204  0.41132988  0.30938540  0.259340111
[3,]  0.1155349 -0.03576782 -0.36892523 -0.368659286
[4,]  0.4143322 -0.03576782 -0.02725575 -0.005036034
[5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574
[6,]  0.8127286  0.41132988  1.29559865  1.241261763