[R] Weighted least squares
hadley wickham
h.wickham at gmail.com
Tue May 8 14:36:07 CEST 2007
On 5/8/07, Adaikalavan Ramasamy <ramasamy at cancer.org.uk> wrote:
> See below.
>
> hadley wickham wrote:
> > Dear all,
> >
> > I'm struggling with weighted least squares, where something that I had
> > assumed to be true appears not to be the case. Take the following
> > data set as an example:
> >
> > df <- data.frame(x = runif(100, 0, 100))
> > df$y <- df$x + 1 + rnorm(100, sd=15)
> >
> > I had expected that:
> >
> > summary(lm(y ~ x, data=df, weights=rep(2, 100)))
> > summary(lm(y ~ x, data=rbind(df,df)))
>
> You assign weights to different points according to some external
> quality or reliability measure not number of times the data point was
> measured.
That is one type of weighting - but what if I have already aggregated
data? That is a perfectly valid type of weighting too.
> Look at the estimates and standard error of the two models below:
>
> coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01
> x 0.982610 0.05893262 16.673448 2.264258e-30
>
> coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01
> x 0.982610 0.04146066 23.6998165 1.012067e-59
>
> You can see that they have same coefficient estimates but the second one
> has smaller variances.
>
> The repeated values artificially deflates the variance and thus inflates
> the precision. This is why you cannot treat replicate data as
> independent observations.
Hardly artificially - I have repeated observations.
> > would be equivalent, but they are not. I suspect the difference is
> > how the degrees of freedom is calculated - I had expected it to be
> > sum(weights), but seems to be sum(weights > 0). This seems
> > unintuitive to me:
> >
> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
> >
> > What am I missing? And what is the usual way to do a linear
> > regression when you have aggregated data?
>
> I would be best to use the individual data points instead of aggregated
> data as it allows you to estimate the within-group variations as well.
There is no within group variation - these are observations that occur
with same values many times in the dataset, so have been aggregated
into the a contingency table-like format.
> If you had individual data points, you could try something as follows.
> Please check the codes as I am no expert in the area of repeated measures.
>
> x <- runif(100, 0, 100)
> y1 <- x + rnorm(100, mean=1, sd=15)
> y2 <- y1 + rnorm(100, sd=5)
>
> df <- data.frame( y=c(y1, y2),
> x=c(x,x),
> subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))
>
> library(nlme)
> summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )
>
> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related
> material for more information. Hope this helps.
I'm not interested in a mixed model, and I don't have individual data points.
Hadley
More information about the R-help
mailing list