[R] Predicted values from glm() when linear predictor is NA.

Bill Dunlap w||||@mwdun|@p @end|ng |rom gm@||@com
Thu Jul 28 20:00:16 CEST 2022


In this example, x has an unknown influence on y (since x doesn't vary)
   > coef(lm(y ~ x, data=data.frame(y=101:105,x=rep(7,5))))
   (Intercept)           x
           103          NA
while in this example x has no influence on y, a very different conclusion
   > coef(lm(y ~ x, data=data.frame(y=rep(103,5),x=1:5)))
    (Intercept)           x
            103           0

Are you arguing that lm's (or glm's) output should have another component
telling us whether a column was included in the model or not (i.e., if a
coefficient was well defined or not)?  Then coef() would have to copy this
information so it could show the user that (e.g., by printing '-' instead
of NA).  That might be reasonable.  Using NA for an unknown coefficient
value also seems reasonable and a lot of code currently depends upon it.
Does it need to be documented better?

-Bill


On Thu, Jul 28, 2022 at 9:50 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:

> Yes, I am familiar with linear algebra. I nod along with you until you get
> to "hence" and then you make a leap. The user made the decision (or
> accepted the default) that the rank problem be ignored... they have the
> option to reduce their model, but they wanted this coefficient
> (communicated via the model and the parameter) to behave as if it were
> zero. The issue is that that decision is getting hidden behind this
> additional implementation decision to exclude certain columns in the model
> matrix _even after the model has been solved_. If they try to use the
> coefficients themselves then they have to apply this odd interpretation of
> NA that is embedded in predict.glm but not apparent in their model instead
> of using a simple newdata %*% coefs.
>
> It may be easier to implement summary.glm with an NA there as a hint to
> the rank deficiency, but at the cost of mathematical inconsistency in the
> reporting of coefficients. IMO either the NA should always be presented to
> the user as if it were a zero or the rank deficiency should be recorded
> separately.
>
> On July 28, 2022 8:46:13 AM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >Dear Jeff,
> >
> >On 2022-07-28 11:12 a.m., Jeff Newmiller wrote:
> >> No, in this case I think I needed the "obvious" breakdown. Still
> digesting, though... I would prefer that if an arbitrary selection had been
> made that it be explicit .. the NA should be replaced with zero if the
> singular.ok argument is TRUE, rather than making that interpretation in
> predict.glm.
> >
> >That's one way to think about, but another is that the model matrix X has
> 10 columns but is of rank 9. Thus 9 basis vectors are needed to span the
> column space of X, and a simple way to provide a basis is to eliminate a
> redundant column, hence the NA. The fitted values y-hat in a linear model
> are the orthogonal projection of y onto the space spanned by the columns of
> X, and are thus independent of the basis chosen. A GLM is a little more
> complicated, but it's still the column space of X that's important.
> >
> >Best,
> > John
> >
> >>
> >> On July 28, 2022 5:45:35 AM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >>> Dear Jeff,
> >>>
> >>> On 2022-07-28 1:31 a.m., Jeff Newmiller wrote:
> >>>> But "disappearing" is not what NA is supposed to do normally. Why is
> it being treated that way here?
> >>>
> >>> NA has a different meaning here than in data.
> >>>
> >>> By default, in glm() the argument singular.ok is TRUE, and so
> estimates are provided even when there are singularities, and even though
> the singularities are resolved arbitrarily.
> >>>
> >>> In this model, the columns of the model matrix labelled LifestageL1
> and TrtTime:LifestageL1 are perfectly collinear -- the second is 12 times
> the first (both have 0s in the same rows and either 1 or 12 in three of the
> rows) -- and thus both can't be estimated simultaneously, but the model can
> be estimated by eliminating one or the other (effectively setting its
> coefficient to 0), or by taking any linear combination of the two
> regressors (i.e., using any regressor with 0s and some other value). The
> fitted values under the model are invariant with respect to this arbitrary
> choice.
> >>>
> >>> My apologies if I'm stating the obvious and misunderstand your
> objection.
> >>>
> >>> Best,
> >>> John
> >>>
> >>>>
> >>>> On July 27, 2022 7:04:20 PM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >>>>> Dear Rolf,
> >>>>>
> >>>>> The coefficient of TrtTime:LifestageL1 isn't estimable (as you
> explain) and by setting it to NA, glm() effectively removes it from the
> model. An equivalent model is therefore
> >>>>>
> >>>>>> fit2 <- glm(cbind(Dead,Alive) ~ TrtTime + Lifestage +
> >>>>> +               I((Lifestage == "Egg + L1")*TrtTime) +
> >>>>> +               I((Lifestage == "L1 + L2")*TrtTime) +
> >>>>> +               I((Lifestage == "L3")*TrtTime),
> >>>>> +             family=binomial, data=demoDat)
> >>>>> Warning message:
> >>>>> glm.fit: fitted probabilities numerically 0 or 1 occurred
> >>>>>
> >>>>>> cbind(coef(fit, complete=FALSE), coef(fit2))
> >>>>>                                    [,1]         [,2]
> >>>>> (Intercept)                -0.91718302  -0.91718302
> >>>>> TrtTime                     0.88846195   0.88846195
> >>>>> LifestageEgg + L1         -45.36420974 -45.36420974
> >>>>> LifestageL1                14.27570572  14.27570572
> >>>>> LifestageL1 + L2           -0.30332697  -0.30332697
> >>>>> LifestageL3                -3.58672631  -3.58672631
> >>>>> TrtTime:LifestageEgg + L1   8.10482459   8.10482459
> >>>>> TrtTime:LifestageL1 + L2    0.05662651   0.05662651
> >>>>> TrtTime:LifestageL3         1.66743472   1.66743472
> >>>>>
> >>>>> There is no problem computing fitted values for the model, specified
> either way. That the fitted values when Lifestage == "L1" all round to 1 on
> the probability scale is coincidental -- that is, a consequence of the data.
> >>>>>
> >>>>> I hope this helps,
> >>>>> John
> >>>>>
> >>>>> On 2022-07-27 8:26 p.m., Rolf Turner wrote:
> >>>>>>
> >>>>>> I have a data frame with a numeric ("TrtTime") and a categorical
> >>>>>> ("Lifestage") predictor.
> >>>>>>
> >>>>>> Level "L1" of Lifestage occurs only with a single value of TrtTime,
> >>>>>> explicitly 12, whence it is not possible to estimate a TrtTime
> "slope"
> >>>>>> when Lifestage is "L1".
> >>>>>>
> >>>>>> Indeed, when I fitted the model
> >>>>>>
> >>>>>>        fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage,
> family=binomial,
> >>>>>>                   data=demoDat)
> >>>>>>
> >>>>>> I got:
> >>>>>>
> >>>>>>> as.matrix(coef(fit))
> >>>>>>>                                      [,1]
> >>>>>>> (Intercept)                -0.91718302
> >>>>>>> TrtTime                     0.88846195
> >>>>>>> LifestageEgg + L1         -45.36420974
> >>>>>>> LifestageL1                14.27570572
> >>>>>>> LifestageL1 + L2           -0.30332697
> >>>>>>> LifestageL3                -3.58672631
> >>>>>>> TrtTime:LifestageEgg + L1   8.10482459
> >>>>>>> TrtTime:LifestageL1                 NA
> >>>>>>> TrtTime:LifestageL1 + L2    0.05662651
> >>>>>>> TrtTime:LifestageL3         1.66743472
> >>>>>>
> >>>>>> That is, TrtTime:LifestageL1 is NA, as expected.
> >>>>>>
> >>>>>> I would have thought that fitted or predicted values corresponding
> to
> >>>>>> Lifestage = "L1" would thereby be NA, but this is not the case:
> >>>>>>
> >>>>>>> predict(fit)[demoDat$Lifestage=="L1"]
> >>>>>>>          26       65      131
> >>>>>>> 24.02007 24.02007 24.02007
> >>>>>>>
> >>>>>>> fitted(fit)[demoDat$Lifestage=="L1"]
> >>>>>>>     26  65 131
> >>>>>>>      1   1   1
> >>>>>>
> >>>>>> That is, the predicted values on the scale of the linear predictor
> are
> >>>>>> large and positive, rather than being NA.
> >>>>>>
> >>>>>> What this amounts to, it seems to me, is saying that if the linear
> >>>>>> predictor in a Binomial glm is NA, then "success" is a certainty.
> >>>>>> This strikes me as being a dubious proposition.  My gut feeling is
> that
> >>>>>> misleading results could be produced.
> >>>>>>
> >>>>>> Can anyone explain to me a rationale for this behaviour pattern?
> >>>>>> Is there some justification for it that I am not currently seeing?
> >>>>>> Any other comments?  (Please omit comments to the effect of "You
> are as
> >>>>>> thick as two short planks!". :-) )
> >>>>>>
> >>>>>> I have attached the example data set in a file "demoDat.txt", should
> >>>>>> anyone want to experiment with it.  The file was created using
> dput() so
> >>>>>> you should access it (if you wish to do so) via something like
> >>>>>>
> >>>>>>        demoDat <- dget("demoDat.txt")
> >>>>>>
> >>>>>> Thanks for any enlightenment.
> >>>>>>
> >>>>>> cheers,
> >>>>>>
> >>>>>> Rolf Turner
> >>>>>>
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list