[R] Interesting behavior of lm() with small, problematic data sets

Tue Sep 5 18:28:51 CEST 2017

> On Sep 5, 2017, at 6:24 AM, Glover, Tim <Tim.Glover at amecfw.com> wrote:
> 
> I've recently come across the following results reported from the lm() function when applied to a particular type of admittedly difficult data.  When working with
> small data sets (for instance 3 points) with the same response for different predicting variable, the resulting slope estimate is a reasonable approximation of the expected 0.0, but the p-value of that slope estimate is a surprising value.  A reproducible example is included below, along with the output of the summary of results
> 
> ######### example code
> x <- c(1,2,3)
> y <- c(1,1,1)
> 
> #above results in{ (1,1) (2,1) (3,1)} data set to regress
> 
> new.rez <- lm (y ~ x) # regress constant y on changing x)
> summary(new.rez) # display results of regression
> 
> ######## end of example code
> 
> Results:
> 
> Call:
> lm(formula = y ~ x)
> 
> Residuals:
>         1          2          3
> 5.906e-17 -1.181e-16  5.906e-17
> 
> Coefficients:
>              Estimate Std. Error    t value Pr(>|t|)
> (Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
> x           -1.772e-16  1.023e-16 -1.732e+00    0.333
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 1.447e-16 on 1 degrees of freedom
> Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
> F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112
> 
> Warning message:
> In summary.lm(new.rez) : essentially perfect fit: summary may be unreliable
> 
> 
> ##############
> 
> There is a warning that the summary may be unreliable sue to the essentially perfect fit, but a p-value of 0.3112 doesn’t seem reasonable.
> As a side note, the various r^2 values seem odd too.

You have an overfitted model with only 3 perfectly fit-able data points and you are whinging about a Wald statistic about which you were warned. I think you are wasting our time. (But I'm fully retired and I have a lot of time to waste.)

I seem to remember that a t-distribution with 1 degree of freedom is actually the Cauchy distribution. I would point out that you can also get:

> 2*pt(-1.732e+00, 1)
[1] 0.3333414

So maybe from that perspective any value might be "reasonable" from the perspective that you have that particular number data points (so one degree of freedom) and are using an estimate of the t-statistic which is essentially the ratio of 0/0 from a numerical point of view.

-- 
David.