Handling of offsets in glm is really inconsistent.

Sat, 22 Aug 1998 18:06:00 +0100 (BST)

[Copied to R-devel for information]

This applies to all versions of R I have: 0.62.2, 0.62.3, 0.63.
Great care seems needed with glms with offsets, as many things seem
wrong.

Consider the following:

> data(freeny)
> freeny.glm <- glm(y ~ offset(lag.quarterly.revenue) + price.index +
    income.level + market.potential, data=freeny, subset=1:30)
> predict(freeny.glm)
            Qtr1       Qtr2       Qtr3       Qtr4
1962:         NA 0.01040457 0.01073223 0.01233351
1963: 0.01211730 0.02744293 0.03259214 0.03225403
1964: 0.03235081 0.03272723 0.03361984 0.03114351
1965: 0.03128098 0.03371820 0.02869783 0.02670215
1966: 0.02346093 0.02172193 0.02269731 0.01927851
1967: 0.02516107 0.02496217 0.02679227 0.03099460
1968: 0.03225807 0.03418791 0.03857815 0.03324247
1969: 0.02575733 0.02702418 0.02988582         NA
> predict(freeny.glm, type="response")
          Qtr1     Qtr2     Qtr3     Qtr4
1962:       NA 8.806765 8.803092 8.803704
1963: 8.826977 8.840453 8.940102 8.968984
1964: 8.993961 8.993167 9.042300 9.061634
1965: 9.100341 9.092428 9.135678 9.153552
1966: 9.194421 9.208372 9.260927 9.284149
1967: 9.309521 9.338742 9.377042 9.389345
1968: 9.429928 9.455688 9.480808 9.520452
1969: 9.549497 9.566824 9.611116       NA

so one might think that prediction of a glm with an offset should
include the offset on the response scale but not link scale. However,

> predict(freeny.glm, newdata=freeny)
   1962.25     1962.5    1962.75       1963    1963.25 
0.01040457 0.01073223 0.01233351 0.01211730 0.02744293 
    1963.5    1963.75       1964    1964.25     1964.5 
0.03259214 0.03225403 0.03235081 0.03272723 0.03361984 
   1964.75       1965    1965.25     1965.5    1965.75 
0.03114351 0.03128098 0.03371820 0.02869783 0.02670215 
      1966    1966.25     1966.5    1966.75       1967 
0.02346093 0.02172193 0.02269731 0.01927851 0.02516107 
   1967.25     1967.5    1967.75       1968    1968.25 
0.02496217 0.02679227 0.03099460 0.03225807 0.03418791 
    1968.5    1968.75       1969    1969.25     1969.5 
0.03857815 0.03324247 0.02575733 0.02702418 0.02988582 
   1969.75       1970    1970.25     1970.5    1970.75 
0.02686281 0.03816228 0.03910487 0.04325638 0.03599068 
      1971    1971.25     1971.5    1971.75 
0.03357946 0.03890549 0.04196893 0.03952385 

> predict(freeny.glm, newdata=freeny, type="response")
   1962.25     1962.5    1962.75       1963    1963.25 
0.01040457 0.01073223 0.01233351 0.01211730 0.02744293 
    1963.5    1963.75       1964    1964.25     1964.5 
0.03259214 0.03225403 0.03235081 0.03272723 0.03361984 
   1964.75       1965    1965.25     1965.5    1965.75 
0.03114351 0.03128098 0.03371820 0.02869783 0.02670215 
      1966    1966.25     1966.5    1966.75       1967 
0.02346093 0.02172193 0.02269731 0.01927851 0.02516107 
   1967.25     1967.5    1967.75       1968    1968.25 
0.02496217 0.02679227 0.03099460 0.03225807 0.03418791 
    1968.5    1968.75       1969    1969.25     1969.5 
0.03857815 0.03324247 0.02575733 0.02702418 0.02988582 
   1969.75       1970    1970.25     1970.5    1970.75 
0.02686281 0.03816228 0.03910487 0.04325638 0.03599068 
      1971    1971.25     1971.5    1971.75 
0.03357946 0.03890549 0.04196893 0.03952385 

and prediction on either scale with newdata ignores the offset. Now, S is
also inconsistent (prior to S-PLUS 4.5 rel2), but at least it does usually
include the offset (except for prediction on link scale with no newdata,
where the offset was omitted until recently). [The different print layout
is due to the original response being a time series of class "ts"; the
predict method cannot know that.]

The discrepancy is in predict.lm (not predict.glm) which ignores offsets in
R but includes them in S. Correcting it is made difficult by the way
delete.response also deletes offsets in R: 

> terms(freeny.glm)
y ~ offset(lag.quarterly.revenue) + price.index + income.level + 
            market.potential
attr(,"variables")
list(y, offset(lag.quarterly.revenue), price.index, income.level, 
            market.potential)
attr(,"factors")
                              price.index income.level market.potential
y                                       0            0                0
offset(lag.quarterly.revenue)           0            0                0
price.index                             1            0                0
income.level                            0            1                0
market.potential                        0            0                1
attr(,"term.labels")
[1] "price.index"      "income.level"     "market.potential"
attr(,"order")
[1] 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,"offset")
[1] 2

> delete.response(terms(freeny.glm))
~price.index + income.level + market.potential
attr(,"variables")
list(price.index, income.level, market.potential)
attr(,"factors")
                 price.index income.level market.potential
price.index                1            0                0
income.level               0            1                0
market.potential           0            0                1
attr(,"term.labels")
[1] "price.index"      "income.level"     "market.potential"
attr(,"order")
[1] 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0

and that is definitely not what is documented to happen.  I do not begin to
understand why delete.response is written the way it is: it looks like it
needs a complete re-design. 

So

- delete.response needs to only delete responses.
- predict.lm needs to handle offsets (or predict.glm, but it is
  much cleaner in predict.lm).
- glm.fit should return  linear.predictors = eta + offset.

There is another small problem:

> predict(freeny.glm, se=T)
Error: Object "price.index" not found

as predict.lm does not recognize that newdata was missing in the caller
(how lazy is lazy evaluation?)  The simplest way out is to have

   if (missing(newdata) || is.null(newdata)) X <- model.matrix(object)

the alternative is to test missing(newdata) in predict.glm.

While this is being done, I wonder why R does not implement offsets
for lm()?  It `looks like an easy exercise'.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._