[Rd] vcov and survival

Martin Maechler maechler at stat.math.ethz.ch
Thu Nov 2 21:59:00 CET 2017


>>>>> Fox, John <jfox at mcmaster.ca>
>>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:

    > Dear Martin, I made three points which likely got lost
    > because of the way I presented them:

    > (1) Singularity is an unusual situation and should be made
    > more prominent. It typically reflects a problem with the
    > data or the specification of the model. That's not to say
    > that it *never* makes sense to allow singular fits (as in
    > the situations you mentions).

    > I'd favour setting singular.ok=FALSE as the default, but
    > in the absence of that a warning or at least a note. A
    > compromise would be to have a singular.ok option() that
    > would be FALSE out of the box.

    > Any changes would have to be made very carefully so as not
    > to create chaos.

I for one, am too reluctant to want to change the default there.

    >  That goes for the points below as well.

    > (2) coef() and vcov() behave inconsistently, which can be
    > problematic because one often uses them together in code.

indeed; and I had agreed on that.
As of today, in R-devel only they now behave compatibly.
NEWS entry

    • The “default” ("lm" etc) methods of vcov() have gained new
      optional argument complete = TRUE which makes the vcov() methods
      more consistent with the coef() methods in the case of singular
      designs.  The former behavior is now achieved by vcov(*,
      complete=FALSE).


    > (3) As you noticed in your second message, lm() has a
    > singular.ok argument and glm() doesn't.

and that has been amended even earlier (a bit more than a month
ago) in R-devel svn rev 73380 with  NEWS  entry

    • glm() and glm.fit get the same singular.ok=TRUE argument that
      lm() has had forever.  As a consequence, in glm(*, method =
      <your_own>), user specified methods need to accept a singular.ok
      argument as well.

    > I'll take a look at the code for glm() with an eye towards
    > creating a patch, but I'm a bit reluctant to mess with the
    > code for something as important as glm().

and as a matter of fact you did send me +- the R code part of
that change.

My current plan is to also add the  'complete = TRUE' option to the
"basic" coef() methods, such that you also have consistent
coef(*, complete=FALSE)  and
vcov(*, complete=FALSE)  behaviors.

Thank you and Terry (and others?) for bringing up the issues and
discussing them thoroughly!

Best,
Martin.


    > Best, John



    >> -----Original Message----- From: Martin Maechler
    >> [mailto:maechler at stat.math.ethz.ch] Sent: Thursday,
    >> September 14, 2017 4:23 AM To: Martin Maechler
    >> <maechler at stat.math.ethz.ch> Cc: Fox, John
    >> <jfox at mcmaster.ca>; Therneau, Terry M., Ph.D.
    >> <therneau at mayo.edu>; r-devel at r-project.org Subject: Re:
    >> [Rd] vcov and survival
    >> 
    >> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>
    >> on Thu, 14 Sep 2017 10:13:02 +0200 writes:
    >> 
    >> >>>>> Fox, John <jfox at mcmaster.ca> >>>>> on Wed, 13 Sep
    >> 2017 22:45:07 +0000 writes:
    >> 
    >> >> Dear Terry, >> Even the behaviour of lm() and glm()
    >> isn't entirely consistent. In both cases, singularity
    >> results in NA coefficients by default, and these are
    >> reported in the model summary and coefficient vector, but
    >> not in the coefficient covariance matrix:
    >> 
    >> >> ----------------
    >> 
    >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley) >>> summary(mod.lm)
    >> 
    >> >> Call: >> lm(formula = Employed ~ GNP + Population +
    >> I(GNP + Population), >> data = longley)
    >> 
    >> >> Residuals: >> Min 1Q Median 3Q Max >> -0.80899
    >> -0.33282 -0.02329 0.25895 1.08800
    >> 
    >> >> Coefficients: (1 not defined because of singularities)
    >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
    >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
    >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
    >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
    >> >> ---
    >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
    >> 0.1 ' ' 1
    >> 
    >> >> Residual standard error: 0.5459 on 13 degrees of
    >> freedom >> Multiple R-squared: 0.9791, Adjusted
    >> R-squared: 0.9758 >> F-statistic: 303.9 on 2 and 13 DF,
    >> p-value: 1.221e-11
    >> 
    >> >>> vcov(mod.lm) >> (Intercept) GNP Population >>
    >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
    >> 0.1445618 0.0001133631 -0.0016054 >> Population
    >> -2.0954381 -0.0016053999 0.0231456 >>> coef(mod.lm) >>
    >> (Intercept) GNP Population I(GNP + Population) >>
    >> 88.93879831 0.06317244 -0.40974292 NA
    >> >>>
    >> >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley) >>> summary(mod.glm)
    >> 
    >> >> Call: >> glm(formula = Employed ~ GNP + Population +
    >> I(GNP + Population), >> data = longley)
    >> 
    >> >> Deviance Residuals: >> Min 1Q Median 3Q Max >>
    >> -0.80899 -0.33282 -0.02329 0.25895 1.08800
    >> 
    >> >> Coefficients: (1 not defined because of singularities)
    >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
    >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
    >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
    >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
    >> >> ---
    >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
    >> 0.1 ' ' 1
    >> 
    >> >> (Dispersion parameter for gaussian family taken to be
    >> 0.2980278)
    >> 
    >> >> Null deviance: 185.0088 on 15 degrees of freedom >>
    >> Residual deviance: 3.8744 on 13 degrees of freedom >>
    >> AIC: 30.715
    >> 
    >> >> Number of Fisher Scoring iterations: 2
    >> 
    >> >>> coef(mod.glm) >> (Intercept) GNP Population I(GNP +
    >> Population) >> 88.93879831 0.06317244 -0.40974292 NA >>>
    >> vcov(mod.glm) >> (Intercept) GNP Population >>
    >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
    >> 0.1445618 0.0001133631 -0.0016054 >> Population
    >> -2.0954381 -0.0016053999 0.0231456
    >> 
    >> >> ----------------
    >> 
    >> >> Moreoever, lm() has a singular.ok() argument that
    >> defaults to TRUE, but glm() doesn't have this argument:
    >> 
    >> >> ----------------
    >> 
    >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley, singular.ok=FALSE) >>
    >> Error in lm.fit(x, y, offset = offset, singular.ok =
    >> singular.ok, ...) : >> singular fit encountered
    >> 
    >> >> ----------------
    >> 
    >> >> In my opinion, singularity should at least produce a
    >> warning, both in calls to lm() and glm(), and in
    >> summary() output. Even better, again in my opinion, would
    >> be to produce an error by default in this situation, but
    >> doing so would likely break too much existing code.
    >> 
    >> > Yes, I would not want to change.  Note that this is
    >> from S > already, i.e., long "ingrained".  I think there
    >> one argument was > that there are situations with factor
    >> predictors of many levels > and conceptually their 2- or
    >> even 3-way interactions (!)  > where it is neat to just
    >> fit the model, (-> get residuals and > fitted values) and
    >> also see implicitly the "necessary rank" of > prediction
    >> space, or rather even more specifically, you see for >
    >> every factor how many levels are "distinguishable"/useful
    >> for > prediction, given the data.
    >> 
    >> >> I prefer NA to 0 for the redundant coefficients
    >> because it at least suggests that the decision about what
    >> to exclude is arbitrary, and of course simply excluding
    >> coefficients isn't the only way to proceed.
    >> 
    >> > I'm less modest and would say *definitely*, NA's are
    >> highly > prefered in such a situation.
    >> 
    >> >> Finally, the differences in behaviour between coef()
    >> and vcov() and between lm() and glm() aren't really
    >> sensible.
    >> 
    >> > I really haven't seen any difference between lm() and
    >> glm() in > the example above.  Maybe you can point them
    >> out for me.
    >> 
    >> .. now I saw it: lm() has a 'singular.ok = TRUE' argument
    >> which you can set to FALSE if you prefer an error to NA
    >> coefficients.
    >> 
    >> I also agree with you John that it would be nice if glm()
    >> also got such an argument.  Patches are welcome and seem
    >> easy. Nowadays we prefer them as attachments (diff/patch
    >> file!) at R's https://bugs.r-project.org bugzilla against
    >> the svn source, here
    >> https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
    >> and
    >> https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
    >> 
    >> > I do quite agree that vcov() should be compatible with
    >> > coef() [and summary()] for both 'lm' and 'glm' methods,
    >> i.e., > should get NA rows and columns there.  This would
    >> require > eliminating these before e.g. using it in
    >> solve(<vcov>, *) etc, > but I think it would be a good
    >> idea that the useR must deal with > these NAs actively.
    >> 
    >> > Shall "we" try and see the fallout in CRAN space?
    >> 
    >> >> Maybe there's some reason for all this that escapes
    >> me.  > (for the first one---"no error"--- I gave a
    >> reason)
    >> 
    >> >> Best, >> John
    >> 
    >> >> --------------------------------------
    >> >> John Fox, Professor Emeritus >> McMaster University >>
    >> Hamilton, Ontario, Canada >> Web:
    >> socserv.mcmaster.ca/jfox
    >> 
    >> 
    >> 
    >> 
    >> >>> -----Original Message----- >>> From: R-devel
    >> [mailto:r-devel-bounces at r-project.org] On Behalf Of >>>
    >> Therneau, Terry M., Ph.D.  >>> Sent: Wednesday, September
    >> 13, 2017 6:19 PM >>> To: r-devel at r-project.org >>>
    >> Subject: [Rd] vcov and survival
    >> >>>
    >> >>> I have just noticed a difference in behavior between
    >> coxph and lm/glm: >>> if one or more of the coefficients
    >> from the fit in NA, then lm and glm >>> omit that
    >> row/column from the variance matrix; while coxph retains
    >> it >>> but sets the values to zero.
    >> >>>
    >> >>> Is this something that should be "fixed", i.e., made
    >> to agree? I >>> suspect that doing so will break other
    >> packages, but then NA coefs are >>> rather rare so
    >> perhaps not.
    >> >>>
    >> >>> Terry Therneau



More information about the R-devel mailing list