[R] Estimate of variance and prediction for multiple linear regression

Gavin Simpson gavin.simpson at ucl.ac.uk
Thu Jun 24 09:16:16 CEST 2010


On Wed, 2010-06-23 at 16:30 -0700, cc super wrote:
> What if the size of the newdata is different from the previous one
> used to generate the regression model?
>  
> Let's say
>  
> pdat <- data.frame(x1 = rnorm(5, 2), x2 = rnorm(5))
> predict(lin, pdat)
>  
> It comes up with warning and the result is not correct.

No it doesn't (not on my system at least) and you give no indication
what the warning/error is or why you think the result is incorrect.

set.seed(1234)
y <- rnorm(10, mean = 5)
dat <- data.frame(x1 = rnorm(10, mean = 2), 
                  x2 = rnorm(10))
lin <- lm(y~x1+x2, data = dat)
pdat <- data.frame(x1 = rnorm(5, 2), x2 = rnorm(5))
predict(lin, pdat)

> predict(lin, pdat)
       1        2        3        4        5 
4.971021 6.482085 5.742130 4.652898 5.194980 

The size of newdata shouldn't matter at all:

> pdat2 <- data.frame(x1 = rnorm(1, 2), x2 = rnorm(1))
> predict(lin, pdat2)
       1 
4.185984
> pdat3 <- data.frame(x1 = rnorm(1000, 2), x2 = rnorm(1000))
> head((p3 <- predict(lin, pdat3)))
       1        2        3        4        5        6 
4.076389 4.118338 4.054443 3.976399 4.055590 4.126018 
> length(p3)
[1] 1000

So can you show the exact code you used plus the errors/warnings you
report and why you think the results are incorrect?

For example, automating your by-hand computation a bit shows that
predict() produces the correct result. Compare:

> ## The cbind(1, ....) bit is for the intercept...
> data.matrix(cbind(1, pdat)) %*% coef(lin)
         [,1]
[1,] 4.178516
[2,] 4.101112
[3,] 4.081370
[4,] 4.146819
[5,] 4.186405
> predict(lin, pdat)
       1        2        3        4        5 
4.178516 4.101112 4.081370 4.146819 4.186405

HTH

G

>  
> Thanks!
> 
>  
> 2010/6/23 Gavin Simpson <gavin.simpson at ucl.ac.uk>
>         On Tue, 2010-06-22 at 23:11 -0700, cc super wrote:
>         > Hi, everyone,
>         >
>         > Night. I have three questions about multiple linear
>         regression in R.
>         >
>         > Q1:
>         >
>         > y=rnorm(10,mean=5)
>         > x1=rnorm(10,mean=2)
>         > x2=rnorm(10)
>         > lin=lm(y~x1+x2)
>         > summary(lin)
>         >
>         > ## In the summary, 'Residual standard error: 1.017 on 7
>         degrees of freedom',
>         > 1.017 is the estimate of the constance variance?
>         
>         
>         Yes, it is sigma.
>         
>         Just a note, in order for the above code to yield the same
>         results as
>         you quote, you need a call to set.seed() to fix the pseudo
>         random number
>         generator.
>         
>         > Q2:
>         >
>         > beta0=lin$coefficients[1]
>         > beta1=lin$coefficients[2]
>         > beta2=lin$coefficients[3]
>         >
>         > y_hat=beta0+beta1*x1+beta2*x2
>         >
>         > ## Is there any built-in function in R to obtain y_hat
>         directly?
>         
>         
>         fitted(lin)
>         
>         Note that there are quite a few standard extractor functions
>         like fitted
>         available for modelling functions in R. coef() for example
>         should be
>         used to extract the coefficients, resid() will extract
>         residuals etc.
>         
>         > Q3:
>         >
>         > If I want to apply this regression result to another
>         dataset, that is, new
>         > x1 and x2. Is the built-in function in 2 still available?
>         
>         
>         It is called predict() (although if you called predict(lin)
>         above
>         instead of fitted(lin) it would have produced the same answer;
>         the
>         fitted values for the observations).
>         
>         One gotcha that catches people out is that in the new dataset,
>         the
>         variables (used in the model) must have the same names as the
>         data frame
>         used to fit it. So we could do:
>         
>         pdat <- data.frame(x1 = rnorm(10, 2), x2 = rnorm(10))
>         predict(lin, pdat)
>         
>         to get predictions at the new values of x1 an x2.
>         
>         > Thank you in advance!
>         
>         HTH
>         
>         G
>         
>         --
>         %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
>         %~%~%~%
>          Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
>          ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>          Pearson Building,             [e]
>         gavin.simpsonATNOSPAMucl.ac.uk
>          Gower Street, London          [w]
>         http://www.ucl.ac.uk/~ucfagls/
>          UK. WC1E 6BT.                 [w]
>         http://www.freshwaters.org.uk
>         %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
>         %~%~%~%
>         
> 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-help mailing list