[R] Overfitting/Calibration plots (Statistics question)

Fri Apr 9 02:15:40 CEST 2010

Mark Seeto wrote:
> This isn't a question about R, but I'm hoping someone will be willing
> to help. I've been looking at calibration plots in multiple regression
> (plotting observed response Y on the vertical axis versus predicted
> response [Y hat] on the horizontal axis).
> 
> According to Frank Harrell's "Regression Modeling Strategies" book
> (pp. 61-63), when making such a plot on new data (having obtained a
> model from other data) we should expect the points to be around a line
> with slope < 1, indicating overfitting. As he writes, "Typically, low
> predictions will be too low and high predictions too high."
> 
> However, when I make these plots, both with real data and with simple
> simulated data, I get the opposite: the points are scattered around a
> line with slope >1. Low predictions are too high and high predictions
> are too low.
> 
> For example, I generated 200 cases, fitted models on the first half of
> the data, and made calibration plots for those models on the second
> half of the data:
> 
>> x1 <- rnorm(200, 0, 1)
>> x2 <- rnorm(200, 0, 1)
>> x3 <- rnorm(200, 0, 1)
>> x4 <- rnorm(200, 0, 1)
>> x5 <- rnorm(200, 0, 1)
>> x6 <- rnorm(200, 0, 1)
>> y <- x1 + x2 + rnorm(200, 0, 2)
>> d <- data.frame(y, x1, x2, x3, x4, x5, x6)
>>
>> lm1 <- lm(y ~ ., data = d[1:100,])
>> lm2 <- lm(y ~ x1 + x2, data = d[1:100,])
>>
>> plot(predict(lm1, d[101:200, ]), d$y[101:200]); abline(0,1)
>> x11(); plot(predict(lm2, d[101:200, ]), d$y[101:200]); abline(0,1)
> 
> The plots for both lm1 and lm2 show the points scattered around a line
> with slope > 1, contrary to what Frank Harrell says should happen.
> 
> I am either misunderstanding something or making a mistake in the code
> (I'm almost 100% certain I'm not mixing up the axes). I would be most
> appreciative if anyone could explain where I'm going wrong.
> 
> Thanks for any help you can provide.
> 
> Mark Seeto

Mark,

Try

set.seed(1)
slope1 <- slope2 <- numeric(100)

for(i in 1:100) {
x1 <- rnorm(200, 0, 1)
x2 <- rnorm(200, 0, 1)
x3 <- rnorm(200, 0, 1)
x4 <- rnorm(200, 0, 1)
x5 <- rnorm(200, 0, 1)
x6 <- rnorm(200, 0, 1)
y <- x1 + x2 + rnorm(200, 0, 2)
d <- data.frame(y, x1, x2, x3, x4, x5, x6)

lm1 <- lm(y ~ ., data = d[1:100,])
lm2 <- lm(y ~ x1 + x2, data = d[1:100,])

slope1[i] <- coef(lsfit(predict(lm1, d[101:200, ]), d$y[101:200]))[2]
slope2[i] <- coef(lsfit(predict(lm2, d[101:200, ]), d$y[101:200]))[2]
}

mean(slope1)
mean(slope2)

I get

 > [1] 0.8873878
 > [1] 0.9603158

Frank


> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University