[R] Weighted least squares

Tue May 8 14:36:07 CEST 2007

On 5/8/07, Adaikalavan Ramasamy <ramasamy at cancer.org.uk> wrote:
> See below.
>
> hadley wickham wrote:
> > Dear all,
> >
> > I'm struggling with weighted least squares, where something that I had
> > assumed to be true appears not to be the case.  Take the following
> > data set as an example:
> >
> > df <- data.frame(x = runif(100, 0, 100))
> > df$y <- df$x + 1 + rnorm(100, sd=15)
> >
> > I had expected that:
> >
> > summary(lm(y ~ x, data=df, weights=rep(2, 100)))
> > summary(lm(y ~ x, data=rbind(df,df)))
>
> You assign weights to different points according to some external
> quality or reliability measure not number of times the data point was
> measured.

That is one type of weighting - but what if I have already aggregated
data?  That is a perfectly valid type of weighting too.

> Look at the estimates and standard error of the two models below:
>
>   coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
>               Estimate Std. Error   t value     Pr(>|t|)
>   (Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
>   x           0.982610 0.05893262 16.673448 2.264258e-30
>
>   coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
>               Estimate Std. Error    t value     Pr(>|t|)
>   (Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
>   x           0.982610 0.04146066 23.6998165 1.012067e-59
>
> You can see that they have same coefficient estimates but the second one
>   has smaller variances.
>
> The repeated values artificially deflates the variance and thus inflates
> the precision. This is why you cannot treat replicate data as
> independent observations.

Hardly artificially - I have repeated observations.

> > would be equivalent, but they are not.  I suspect the difference is
> > how the degrees of freedom is calculated - I had expected it to be
> > sum(weights), but seems to be sum(weights > 0).  This seems
> > unintuitive to me:
> >
> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
> >
> > What am I missing?  And what is the usual way to do a linear
> > regression when you have aggregated data?
>
> I would be best to use the individual data points instead of aggregated
> data as it allows you to estimate the within-group variations as well.

There is no within group variation - these are observations that occur
with same values many times in the dataset, so have been aggregated
into the a contingency table-like format.

> If you had individual data points, you could try something as follows.
> Please check the codes as I am no expert in the area of repeated measures.
>
>   x  <- runif(100, 0, 100)
>   y1 <- x + rnorm(100, mean=1, sd=15)
>   y2 <- y1 + rnorm(100, sd=5)
>
>   df <- data.frame( y=c(y1, y2),
>                     x=c(x,x),
>                     subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))
>
>   library(nlme)
>   summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )
>
> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related
> material for more information. Hope this helps.

I'm not interested in a mixed model, and I don't have individual data points.

Hadley