# [R] Weighted least squares

Tue May 8 14:16:50 CEST 2007

```See below.

> Dear all,
>
> I'm struggling with weighted least squares, where something that I had
> assumed to be true appears not to be the case.  Take the following
> data set as an example:
>
> df <- data.frame(x = runif(100, 0, 100))
> df\$y <- df\$x + 1 + rnorm(100, sd=15)
>
>
> summary(lm(y ~ x, data=df, weights=rep(2, 100)))
> summary(lm(y ~ x, data=rbind(df,df)))

You assign weights to different points according to some external
quality or reliability measure not number of times the data point was
measured.

Look at the estimates and standard error of the two models below:

coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
x           0.982610 0.05893262 16.673448 2.264258e-30

coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
x           0.982610 0.04146066 23.6998165 1.012067e-59

You can see that they have same coefficient estimates but the second one
has smaller variances.

The repeated values artificially deflates the variance and thus inflates
the precision. This is why you cannot treat replicate data as
independent observations.

> would be equivalent, but they are not.  I suspect the difference is
> how the degrees of freedom is calculated - I had expected it to be
> sum(weights), but seems to be sum(weights > 0).  This seems
> unintuitive to me:
>
> summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
> summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
>
> What am I missing?  And what is the usual way to do a linear
> regression when you have aggregated data?

I would be best to use the individual data points instead of aggregated
data as it allows you to estimate the within-group variations as well.

If you had individual data points, you could try something as follows.
Please check the codes as I am no expert in the area of repeated measures.

x  <- runif(100, 0, 100)
y1 <- x + rnorm(100, mean=1, sd=15)
y2 <- y1 + rnorm(100, sd=5)

df <- data.frame( y=c(y1, y2),
x=c(x,x),
subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))

library(nlme)
summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )

Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related