[R] A doubt about "lm" and the meaning of its summary

Tue Aug 19 19:36:42 CEST 2008

Alberto,

Moshe was actually on the right track.

Your model does not satisfy enough of the details in the usual 
Gauss-Markov setup to guarantee unbiasedness of the LS solution.

In particular, the errors must be orthogonal to the design columns for the 
unbiasedness of the LS solution.

But, in your case, they are not.

This is easily seen when n == 2, m==0, c==0.

Take eps <- rnorm(2)

Then, x = c( 0, eps[1] ), y <- c( eps[1], eps[2] ).

The expectation of y is evidently 0, so the observed y is the error 
vector.

With basic linear algebra, you can verify that the vector rbind( 1, x ) 
%*% y has a non-zero expectation and that the regression coefficient for 
'x' has negative expectation.

And just for fun:

The correlation when n == 2 is either +1 or -1. You can easily deduce from 
a simple graphical argument (or equivalent arguments invoking symmetry in 
the bivariate distribution of eps) that -1 correlations are exactly 
3 times as likely as +1 correlations.

HTH,

Chuck

On Mon, 18 Aug 2008, Moshe Olshansky wrote:

> Hi Alberto,
>
> Please disregard my previous note - I probably had a black-out!!!
>
>
> --- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:
>
>> From: Alberto Monteiro <albmont at centroin.com.br>
>> Subject: [R] A doubt about "lm" and the meaning of its summary
>> To: r-help at r-project.org
>> Received: Tuesday, 19 August, 2008, 4:31 AM
>> I have a conceptual problem (sort of). Maybe there's a
>> simple solution,
>> maybe not.
>>
>> First, let me explain the test case that goes ok.
>>
>> Let x be a (fixed) n-dimensional vector. I simulate a lot
>> of linear models,
>> y = m x + c + error, then I do a lot of regressions. As
>> expected, the
>> estimated values of m (let's call them m.bar) are
>> distributed according to a
>> Student's t distribution.
>>
>> More precisely (test case, it takes a few minutes to run):
>> #
>> # start with fixed numbers
>> #
>>   m <- sample(c(-1, -0.1, 0.1, 1), 1)
>>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>>   n <- sample(c(4,5,10,1000), 1)
>>   x <- 1:n # it's possible to use other x
>>   NN <- sample(c(3000, 4000, 5000), 1)
>>   m.bar <- m.std.error <- 0  # these vectors will
>> hold the simulation output
>>
>> #
>> # Now let's repeat NN times:
>> # simulate y
>> # regress y ~ x
>> # store m.bar and its error
>> #
>>   for (i in 1:NN) {
>>     y <- m * x + c + sigma * rnorm(n)
>>     reg <- lm(y ~ x)
>>     m.bar[i] <- summary(reg)$coefficients['x',
>> 'Estimate']
>>     m.std.error[i] <-
>> summary(reg)$coefficients['x', 'Std. Error']
>>   }
>> #
>> # Finally, let's analyse it
>> # The distribution of (m.bar - m) / m.std.error is
>> # a Student's t with n - 2 degrees of freedom.
>> # Is it? Let's test!
>> #
>>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
>>
>> Now, this goes ok, with ks.test returning a number
>> uniformly distributed in
>> the 0-1 interval.
>>
>> Next, I did a slightly different model. This model is a
>> reversion model,
>> where the simulated random variable lp goes along the
>> equation:
>> lp[t + 1] <- (1 + m) * lp[t] + c + error
>>
>> I tried to use the same method as above, defining
>> x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
>> y <- m * x + c + error
>>
>> And now it breaks. Test case (it takes some minutes to
>> run):
>>
>> #
>> # start with fixed numbers
>> #
>>   m <- runif(1, -1, 0)  # m must be something in the
>> (-1,0) interval
>>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>>   n <- sample(c(4,5,10,1000), 1)
>>   NN <- sample(c(3000, 4000, 5000), 1)
>>   m.bar <- m.std.error <- 0  # these vectors will
>> hold the simulation output
>>
>> #
>> # Now let's repeat NN times:
>> # simulate y
>> # regress y ~ x
>> # store m.bar and its error
>> #
>>   for (i in 1:NN) {
>>     lp <- 0
>>     lp[1] <- 0
>>     for (j in 1:n) {
>>       lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
>>     }
>>     x <- lp[1:n]
>>     y <- lp[-1] - x
>>     reg <- lm(y ~ x)
>>     m.bar[i] <- summary(reg)$coefficients['x',
>> 'Estimate']
>>     m.std.error[i] <-
>> summary(reg)$coefficients['x', 'Std. Error']
>>   }
>> #
>> # Finally, let's analyse it
>> # The distribution of (m.bar - m) / m.std.error should be
>> # a Student's t with n - 2 degrees of freedom.
>> # Is it? Let's test!
>> #
>>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
>>
>> ... and now it's not.
>>
>> What's wrong with this? I suspect that the model y ~ x
>> does only give
>> meaningful values when x is deterministic; in this case x
>> is stochastic. Is
>> there any correct way to estimate this model giving
>> meaningful outputs?
>>
>> Alberto Monteiro
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained,
>> reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901