[Rd] problems with glm (PR#771)

Fri, 29 Dec 2000 09:02:42 +0100 (MET)

[We do have a TooMuchAtOnce category, but it is easier to cope with
short separate reports with informative subject lines for, e.g. the BUGS
list.]

On Mon, 18 Dec 2000 james.lindsey@luc.ac.be wrote:

> R1.2.0 with Linux RH5.2
> 
> I do not believe that the problems below are new to 1.2 but I only
> cover this sort of thing once a year in my course and some of that
> happened to be last Friday so too late to report for 1.2. I see that
> one or two things that I was going to report have been corrected.
> I like the fact that interactions now show : instead of .
> 
> Here is some output with comments inserted.
> 
[...]

> > wt <- codes(Reg66)!=codes(Reg71)

Why not just wt <- Reg66 != Reg71  ?

[...]

> First problem: with diagonal values weighted out in a transition matrix
> (mover-stayer model), glm wrongly estimates the diagonals to be the
> observed plus 0.1. This was correct in 90.1 a year ago but already
> wrong in 1.0.1. I don't have any versions in between installed.

I think this is the same as

[Rd] glm gives incorrect results for zero-weight cases (PR#780)

in which case it is already fixed in R-patched.  That does look to have
solved this one too.  Note that glm has *not* `estimated the diagonals'
at all.  These are predictions (the cases are not in the model), and
predict.glm did get them right.

predict(z, data)
      1       2       3       4       5       6       7       8 
0.61337 3.30334 3.45098 4.51559 3.33786 6.02783 6.17548 7.24009 
      9      10      11      12      13      14      15      16 
3.50109 6.19106 6.33871 7.40332 4.57309 7.26306 7.41071 8.47532 

[...]

> Second problem: the man page for factors states that the levels are
> sorted, but gl does not do this. It keeps them in the order of the

There is a help page for factor(), not for factors, and I don't believe
that does state so. It says that *factor* sorts them by default, which is
true.  Where is the `man page for factors'?

> labels. This is inconsistent with everything else, confusing for
> students, and liable to induce errors in data analysis. In the example
> below, it is not too important because the baseline category is the
> same in the two cases but that will not generally be the case.
> 
> As an addition comment, I would strongly recommend that, if labels can
> be coerced to numeric, they be ordered, not sorted. This is because
> the order is very confusing if there are more than 9: for example, the
> sorted 10 comes near the beginning and not at the end.

You can do that, of course, by specifying the order of the levels.  
However, we teach students that the details of the coding of linear models
are details, and that they should regard coefficients as secondary and
predictions as primary.  aov() has the right idea in suppressing the
individual coefficients.  If you want to interpret coefficients you need to
understand codings. Period.

I don't see the value of changing the default for factor: it would not be
backwards-compatible, and some people have thought thought what the default
would be and accepted it. 

> > reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL"))
> > z
> 
> Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 
> 
> Coefficients:
> (Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
>      0.6134       3.9022       2.6900       2.8376       3.9597       2.7245  
>     Reg71WM  
>      2.8877  
> 
> Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
> Null Deviance:	    40740 
> Residual Deviance: 19880 	AIC: 20000 
> > print(za <- glm(Freq~reg66+Reg71,family=poisson))
> 
> Call:  glm(formula = Freq ~ reg66 + Reg71, family = poisson) 
> 
> Coefficients:
> (Intercept)      reg66LY      reg66WM      reg66GL      Reg71GL      Reg71LY  
>      0.6134       2.6900       2.8376       3.9022       3.9597       2.7245  
>     Reg71WM  
>      2.8877  
> 
> Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
> Null Deviance:	    40740 
> Residual Deviance: 19880 	AIC: 20000 

> Third problem: if mean value contrasts are used, the level names are
> not attached to the variable name, but simply the number of the
> category. This rightly drives students mad and is simply
> uninterpretable if the levels have been sorted. It takes a great deal
> of extreme care to try to find out to what the numbers refer.

Are we supposed to guess that `mean value contrasts' means contr.sum?  In
which case the numbers refer to the number of the *contrast*: they do not
refer to levels, nor are single levels relevant. As in

> contr.sum(levels(Reg66))
   [,1] [,2] [,3]
CC    1    0    0
GL    0    1    0
LY    0    0    1
WM   -1   -1   -1
> contr.treatment(levels(Reg66))
   GL LY WM
CC  0  0  0
GL  1  0  0
LY  0  1  0
WM  0  0  1

Now, what should the column labels be in the first case?  And in what sense
are these `mean value contrasts'?

Do you really want labels like "CCvWM"?  They could get very cumbersome.

(One could argue that contr.treatment is already wrong, but as they
are not even contrasts ....)

[...]

> Additional comment: I recommend that the aic functions for poisson and
> binomial have the explicit calculation of the log density replaced by
> the corresponding d function with the log option. This will avoid the
> production of NAs in certain cases of zeroes in tables.

Good idea, but could you supply examples so we can put them in the
regression tests?

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._