[R] Strange degrees of freedom and SS from car::Anova with type II SS?

Thu Dec 6 03:43:06 CET 2018

Dear R.,

The problem you constructed is too ill-conditioned for the method that Anova() uses to compute type-II sums of squares and the associated degrees of freedom, with an immense condition number of the coefficient covariance matrix:

> library(car)
Loading required package: carData

> mod <- lm(prestige ~ women * type * income * education, data=Prestige)
> e <- eigen(vcov(mod))$values
> max(e)/min(e)
[1] 2.776205e+17

Simply centering the numerical predictors reduces the condition number by a factor of 10^3, which allows Anova() to work, even though the problem is still extremely ill-conditioned:

> Prestige.c <- within(Prestige, {
+   income <- income - mean(income)
+   education <- education - mean(education)
+   women <- women - mean(women)
+ })
> mod.c <- lm(prestige ~ women * type * income * education, data=Prestige.c)
> e.c <- eigen(vcov(mod.c))$values
> max(e)/min(e)
[1] 2.776205e+17

> Anova(mod.c)
Anova Table (Type II tests)

Response: prestige
                             Sum Sq Df F value    Pr(>F)    
women                        167.29  1  4.9516 0.0291142 *  
type                         744.30  2 11.0150 6.494e-05 ***
income                       789.00  1 23.3529 7.112e-06 ***
education                    699.54  1 20.7050 2.057e-05 ***
women:type                   140.32  2  2.0766 0.1326023    
women:income                  33.14  1  0.9807 0.3252424    
type:income                  653.40  2  9.6697 0.0001859 ***
women:education               30.36  1  0.8986 0.3462316    
type:education                 0.72  2  0.0107 0.9893462    
income:education               7.88  1  0.2331 0.6306681    
women:type:income            136.80  2  2.0245 0.1393087    
women:type:education         140.18  2  2.0745 0.1328633    
women:income:education       100.42  1  2.9722 0.0888832 .  
type:income:education         82.02  2  1.2138 0.3029069    
women:type:income:education    2.05  2  0.0303 0.9701334    
Residuals                   2500.16 74                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> mod.c.2 <- update(mod.c, . ~ . - women:type:income:education)
> sum(residuals(mod.c.2)^2) - sum(residuals(mod.c)^2)
[1] 2.049735

Beyond demonstrating that the algorithm that Anova() uses can be made to fail if the coefficient covariance matrix is sufficiently ill-conditioned problem, I’m not sure what the point of this is. I suppose that we could try to detect this condition, which falls in the small region between where lm() detects a singularity and the projections used by Anova() break down.

Best,
 John

-------------------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http::/socserv.mcmaster.ca/jfox

> On Dec 5, 2018, at 7:33 PM, Ramon Diaz-Uriarte <rdiaz02 using gmail.com> wrote:
> 
> 
> Dear All,
> 
> I do not understand the degrees of freedom returned by car::Anova under
> some models. They seem to be too many (e.g., numerical variables getting
> more than 1 df, factors getting more df than levels there are).
> 
> This is a reproducible example:
> 
> library(car)
> data(Prestige)
> 
> ## Make sure no issues from NAs in comparisons of SS below
> prestige_nona <- na.omit(Prestige)
> 
> Anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona))
> 
> ## Notice how women, a numerical variable, has 3 df
> ## and type (factor with 3 levels) has 4 df.
> 
> 
> ## In contrast this seems to get the df right:
> Anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona), type = "III")
> 
> ## And also gives the df I'd expect
> anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona))
> 
> 
> 
> ## Type II SS for women in the above model I do not understand either.
> m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)
> ## Does not match women SS
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
> 
> ## See [1] below for examples where they match.
> 
> 
> Looking at the code, I do not understand what the call from
> linearHypothesis returns here (specially compared to other models), and the
> problem seems to be in the return from ConjComp, possibly due to the the
> vcov of the model? (But this is over my head).
> 
> 
> I understand this is not a reasonable model to fit, and there are possibly
> serious collinearity problems. But I was surprised by the dfs in the
> absence of any warning of something gone wrong. So I think there is
> something very basic I do not understand.
> 
> 
> 
> Thanks,
> 
> 
> R.
> 
> 
> [1] In contrast, in other models I see what I'd expect. For example:
> 
> ## 1 df for women, 2 for type
> Anova(lm(prestige ~ type * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ type * income, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
> 
> ## 1 df for women, income, education
> Anova(lm(prestige ~ education * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ education * income, data = prestige_nona)
> m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
> 
> 
> 
> 
> --
> Ramon Diaz-Uriarte
> Department of Biochemistry, Lab B-25
> Facultad de Medicina
> Universidad Autónoma de Madrid
> Arzobispo Morcillo, 4
> 28029 Madrid
> Spain
> 
> Phone: +34-91-497-2412
> 
> Email: rdiaz02 using gmail.com
>       ramon.diaz using iib.uam.es
> 
> http://ligarto.org/rdiaz
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.