[R] Weighted least squares

Tue May 8 14:16:50 CEST 2007

See below.

hadley wickham wrote:
> Dear all,
> 
> I'm struggling with weighted least squares, where something that I had
> assumed to be true appears not to be the case.  Take the following
> data set as an example:
> 
> df <- data.frame(x = runif(100, 0, 100))
> df$y <- df$x + 1 + rnorm(100, sd=15)
> 
> I had expected that:
> 
> summary(lm(y ~ x, data=df, weights=rep(2, 100)))
> summary(lm(y ~ x, data=rbind(df,df)))

You assign weights to different points according to some external 
quality or reliability measure not number of times the data point was 
measured.

Look at the estimates and standard error of the two models below:

  coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
              Estimate Std. Error   t value     Pr(>|t|)
  (Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
  x           0.982610 0.05893262 16.673448 2.264258e-30

  coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
              Estimate Std. Error    t value     Pr(>|t|)
  (Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
  x           0.982610 0.04146066 23.6998165 1.012067e-59

You can see that they have same coefficient estimates but the second one 
  has smaller variances.

The repeated values artificially deflates the variance and thus inflates 
the precision. This is why you cannot treat replicate data as 
independent observations.

> would be equivalent, but they are not.  I suspect the difference is
> how the degrees of freedom is calculated - I had expected it to be
> sum(weights), but seems to be sum(weights > 0).  This seems
> unintuitive to me:
> 
> summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
> summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
> 
> What am I missing?  And what is the usual way to do a linear
> regression when you have aggregated data?

I would be best to use the individual data points instead of aggregated 
data as it allows you to estimate the within-group variations as well.

If you had individual data points, you could try something as follows. 
Please check the codes as I am no expert in the area of repeated measures.

  x  <- runif(100, 0, 100)
  y1 <- x + rnorm(100, mean=1, sd=15)
  y2 <- y1 + rnorm(100, sd=5)

  df <- data.frame( y=c(y1, y2),
                    x=c(x,x),
                    subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))

  library(nlme)
  summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )

Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related 
material for more information. Hope this helps.

> Thanks,
> 
> Hadley

Regards, Adai