[R] Help needed in interpreting linear models

Petr PIKAL petr.pikal at precheza.cz
Fri Jan 13 10:35:53 CET 2012


Hi

It seems to me quite like a homework for which the policy of this list is 
not to respond.
But far from being an expert in statistics I only express my opinion. It 
seems to me that your height variable behaves like a two level factor and 
the 190 value points to rather suspicious value in weight if I look at the 
plot

plot(scores, weight)

Regards
Petr


> Dear members of the R-help list,
> 
> I have sent the email below to the R-SIG-ME list to ask for help in
> interpreting some R output of fitted linear models.
> 
> Unfortunately, I haven't yet received any answers. As I am not sure if 
my
> email was sent successfully to the mailing list I
> 
> am asking for help here:
> 
> 
> 
> Dear members of the R-SIG-ME list,
> 
> 
> I am new to linear models and struggling with interpreting some of the R
> output but hope to get some advice from here.
> 
> I created the following dummy data set:
> 
> scores <- c(2,6,10,12,14,20)
> 
> weight <- c(60,70,80,75,80,85)
> 
> height <- c(180,180,190,180,180,180)
> 
> The scores of a game/match should be dependent on the weight of the 
player
> but not on the height. 
> 
> For me the output of the following two linear models make sense:
> 
> > (lm1 <- summary(lm(scores ~ weight)))
> 
> Call:
> lm(formula = scores ~ weight)
> 
> Residuals:
>        1        2        3        4        5        6 
>  1.08333 -1.41667 -3.91667  1.33333  0.08333  2.83333 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|) 
> (Intercept) -38.0833    10.0394  -3.793  0.01921 * 
> weight        0.6500     0.1331   4.885  0.00813 **
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> 
> Residual standard error: 2.661 on 4 degrees of freedom
> Multiple R-squared: 0.8564,   Adjusted R-squared: 0.8205 
> F-statistic: 23.86 on 1 and 4 DF,  p-value: 0.008134 
> 
> > 
> > (lm2 <- summary(lm(scores ~ height)))
> 
> Call:
> lm(formula = scores ~ height)
> 
> Residuals:
>          1          2          3          4          5          6 
> -8.800e+00 -4.800e+00  1.377e-14  1.200e+00  3.200e+00  9.200e+00 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)  25.2000   139.6175   0.180    0.866
> height       -0.0800     0.7684  -0.104    0.922
> 
> Residual standard error: 7.014 on 4 degrees of freedom
> Multiple R-squared: 0.002703,   Adjusted R-squared: -0.2466 
> F-statistic: 0.01084 on 1 and 4 DF,  p-value: 0.9221 
> 
> The p-value of the first output is 0.008134 which makes sense as scores 
and
> weight have a high correlation
> 
> and therefore, the scores "can be explained" by the explanatory
> variable/factor weight very well. Hence, the R-squared
> 
> value is close to 1. For the second example it also makes sense that the
> p-value is almost 1 (p=0.9221) as there is
> 
> hardly any correlation between scores and height.
> 
> What is not clear to me is shown in my 3rd linear model which includes 
both
> weight and height.
> 
> > (lm3 <- summary(lm(scores ~ weight + height)))
> 
> Call:
> lm(formula = scores ~ weight + height)
> 
> Residuals:
>          1          2          3          4          5          6 
>  1.189e+00 -1.946e+00 -2.165e-15  4.865e-01 -1.081e+00  1.351e+00 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|) 
> (Intercept) 49.45946   33.50261   1.476  0.23635 
> weight       0.71351    0.08716   8.186  0.00381 **
> height      -0.50811    0.19096  -2.661  0.07628 . 
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> 
> Residual standard error: 1.677 on 3 degrees of freedom
> Multiple R-squared: 0.9573,   Adjusted R-squared: 0.9288 
> F-statistic:  33.6 on 2 and 3 DF,  p-value: 0.008833 
> 
> It makes sense that the R-squared value is higher when one adds both
> explanatory variables/factors to the linear model as 
> 
> the more variables are added the more variance is explained and 
therefore
> the fit of the model will be better. However, I do NOT
> 
> understand why the p-value of height (Pr(> | t |)  = 0.07628) is now 
almost
> significant? And also, I do NOT understand why the overall
> 
> p-value of 0.008833 is less significant as compared to the one from 
model
> lm1 which was p-value: 0.008134.
> 
> The p-value of weight being low (p=0.00381) makes sense as this factor
> "explains" the scores very well.
> 
> 
> 
> After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model 
lm1
> with lm3 using the anova function to check whether the factor height
> 
> significantly improves the model. In other words I wanted to check if 
adding
> height to the model helps explaining the scores of the players.
> 
> The output of the anova looks as follows:
> 
> > lm1 <- lm(scores ~ weight)
> > 
> > lm2 <- lm(scores ~ weight + height)
> > 
> > anova(lm1,lm2)
> Analysis of Variance Table
> 
> Model 1: scores ~ weight
> Model 2: scores ~ weight + height
>   Res.Df     RSS Df Sum of Sq      F  Pr(>F) 
> 1      4 28.3333 
> 2      3  8.4324  1    19.901 7.0801 0.07628 .
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> 
> In my opinion the p-value should be almost 1 and not close to 
significance
> (0.07) as we have seen from model lm2
> 
> height does not at all "explain" the scores. Here, I thought that a
> significant p-value means that the factor height adds
> 
> significant value to the model.
> 
> 
> I would be very grateful if anyone could help me in interpreting the R
> output.
> 
> Best regards
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-needed-
> in-interpreting-linear-models-tp4291670p4291670.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list