[R] Simple lm/regression question

Mon Feb 6 10:57:23 CET 2012

On Mon, 6 Feb 2012, James Annan wrote:

> I am trying to use lm for a simple linear fit with weights. The results 
> I get from IDL (which I am more familiar with) seem correct and 
> intuitive, but the "lm" function in R gives outputs that seem strange to 
> me.
>
> Unweighted case:
>
>> x<-1:4
>> y<-(1:4)^2
>> summary(lm(y~x))
>
> Call:
> lm(formula = y ~ x)
>
> Residuals:
> 1  2  3  4
> 1 -1 -1  1
>
> Coefficients:
>            Estimate Std. Error t value Pr(>|t|)
> (Intercept)  -5.0000     1.7321  -2.887   0.1020
> x             5.0000     0.6325   7.906   0.0156 *
>
> So far, so good - IDL does much the same:
>
> IDL> vec=linfit(x,y,sigma=sig)
> IDL> print,vec,sig
>     -5.00000      5.00000
>      1.73205     0.632456
>
> Now, if the dependent variable has known (measurement) uncertainties (10, 
> say) then it is appropriate to use weights defined as the inverse of the 
> variances, right?

If you think that in y = x'b + e the error has

E(e^2) = sigma^2 * w

then you should use

weights = 1/w

in R because that is how the weights enter the objective function

sum 1/w (y - x'b)^2

>> summary(lm(y~x,weights=c(.01,.01,.01,.01)))
>
> Call:
> lm(formula = y ~ x, weights = c(0.01, 0.01, 0.01, 0.01))
>
> Residuals:
>   1    2    3    4
> 0.1 -0.1 -0.1  0.1
>
> Coefficients:
>            Estimate Std. Error t value Pr(>|t|)
> (Intercept)  -5.0000     1.7321  -2.887   0.1020
> x             5.0000     0.6325   7.906   0.0156 *
>
> But here the *residuals* have changed and are no longer the actual 
> response minus fitted values. (They seem to be scaled by the sqrt of the 
> weights).

The summary() shows under "Residuals" the contributions to the objective 
function, i.e. sqrt(1/w) (y - x'b) in the notation above.

However, by using the residuals extractor function you can get the 
unweighted residuals:

residuals(lm(y~x,weights=c(.01,.01,.01,.01)))

> The uncertainties on the parameter estimates, however, have 
> *not* changed, which seems very odd to me.

lm() interprets the weights as precision weights, not as case weights.

Thus, the scaling in the variances is done by the number of (non-zero) 
weights, not by the sum of weights.

> The behaviour of IDL is rather different and intuitive to me:
>
> IDL> vec=linfit(x,y,sigma=sig,measure_errors=[1,1,1,1])
> IDL> print,vec,sig
>     -5.00000      5.00000
>      1.22474     0.447214
>
> IDL> vec=linfit(x,y,sigma=sig,measure_errors=[10,10,10,10])
> IDL> print,vec,sig
>     -5.00000      5.00000
>      12.2474      4.47214

This appears to use sandwich standard errors. If you load the sandwich and 
lmtest package you can do

coeftest(m, vcov = sandwich)

where 'm' is the fitted lm object.

Hope that helps,
Z

> Note that the uncertainties are 10* larger when the errors are 10* larger.
>
> My question is really how to replicate these results in R. I suspect there is 
> some terminology or convention that I'm confused by. But in the lm help page 
> documentation, even the straightforward definition of "residual" seems 
> incompatible with what the code actually returns:
>
> 	residuals: the residuals, that is response minus fitted values.
>
> But, in the weighted case, this is not actually true...
>
> James
> -- 
> James D Annan jdannan at jamstec.go.jp Tel: +81-45-778-5618 (Fax 5707)
> Senior Scientist, Research Institute for Global Change, JAMSTEC
> Yokohama Institute for Earth Sciences, 3173-25 Showamachi,
> Kanazawa-ku, Yokohama City, Kanagawa, 236-0001 Japan
> http://www.jamstec.go.jp/frcgc/research/d5/jdannan/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>