[R] Weighted least squares

Tue May 8 16:12:44 CEST 2007

Sorry, you did not explain that your weights correspond to your 
frequency in the original post. I assumed they were repeated 
measurements with within group variation.

I was merely responding to your query why the following differed.
    summary(lm(y ~ x, data=df, weights=rep(2, 100)))
    summary(lm(y ~ x, data=rbind(df,df)))

Let me also clarify my statement about "artificial". If one treats 
repeated observations as independent, then they obtain estimates with 
inflated precision. I was not calling your data artificial in any way.

Using frequency as weights may be valid. Your data points appear to 
arise from discrete distribution, so I am not entirely sure if you can 
use the linear model which assumes the errors are normally distributed.

Regards, Adai

hadley wickham wrote:
> On 5/8/07, Adaikalavan Ramasamy <ramasamy at cancer.org.uk> wrote:
>> See below.
>>
>> hadley wickham wrote:
>> > Dear all,
>> >
>> > I'm struggling with weighted least squares, where something that I had
>> > assumed to be true appears not to be the case.  Take the following
>> > data set as an example:
>> >
>> > df <- data.frame(x = runif(100, 0, 100))
>> > df$y <- df$x + 1 + rnorm(100, sd=15)
>> >
>> > I had expected that:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(2, 100)))
>> > summary(lm(y ~ x, data=rbind(df,df)))
>>
>> You assign weights to different points according to some external
>> quality or reliability measure not number of times the data point was
>> measured.
> 
> That is one type of weighting - but what if I have already aggregated
> data?  That is a perfectly valid type of weighting too.
> 
>> Look at the estimates and standard error of the two models below:
>>
>>   coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
>>               Estimate Std. Error   t value     Pr(>|t|)
>>   (Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
>>   x           0.982610 0.05893262 16.673448 2.264258e-30
>>
>>   coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
>>               Estimate Std. Error    t value     Pr(>|t|)
>>   (Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
>>   x           0.982610 0.04146066 23.6998165 1.012067e-59
>>
>> You can see that they have same coefficient estimates but the second one
>>   has smaller variances.
>>
>> The repeated values artificially deflates the variance and thus inflates
>> the precision. This is why you cannot treat replicate data as
>> independent observations.
> 
> Hardly artificially - I have repeated observations.
> 
>> > would be equivalent, but they are not.  I suspect the difference is
>> > how the degrees of freedom is calculated - I had expected it to be
>> > sum(weights), but seems to be sum(weights > 0).  This seems
>> > unintuitive to me:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
>> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
>> >
>> > What am I missing?  And what is the usual way to do a linear
>> > regression when you have aggregated data?
>>
>> I would be best to use the individual data points instead of aggregated
>> data as it allows you to estimate the within-group variations as well.
> 
> There is no within group variation - these are observations that occur
> with same values many times in the dataset, so have been aggregated
> into the a contingency table-like format.
> 
>> If you had individual data points, you could try something as follows.
>> Please check the codes as I am no expert in the area of repeated 
>> measures.
>>
>>   x  <- runif(100, 0, 100)
>>   y1 <- x + rnorm(100, mean=1, sd=15)
>>   y2 <- y1 + rnorm(100, sd=5)
>>
>>   df <- data.frame( y=c(y1, y2),
>>                     x=c(x,x),
>>                     subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))
>>
>>   library(nlme)
>>   summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )
>>
>> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related
>> material for more information. Hope this helps.
> 
> I'm not interested in a mixed model, and I don't have individual data 
> points.
> 
> Hadley
> 
> 
>