[R] fitting in logistic model

Tue Jun 30 19:54:21 CEST 2009

On 30-Jun-09 17:41:20, Marc Schwartz wrote:
> On Jun 30, 2009, at 10:44 AM, Ted Harding wrote:
> 
>>
>> On 30-Jun-09 14:52:20, Marc Schwartz wrote:
>>> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
>>>
>>>> I would like to know how R computes the probability of an event
>>>> in a logistic model (P(y=1)) from the score s, linear combination
>>>> of x and beta.
>>>>
>>>> I noticed that there are differences (small, less than e-16) between
>>>> the fitting values automatically computed in the glm procedure by R,
>>>> and the values "manually" computed by me applying the reverse
>>>> formula p=e^s/(1+e^s); moreover I noticed  that the minimum value
>>>> of the fitting values in my estimation is 2.220446e-16, and there
>>>> are many observation with this probability (instead the minimum
>>>> value obtained by "manually" estimation is 2.872636e-152).
>>>
>>> It would be helpful to see at least a subset of the output from your
>>> model and your manual computations so that we can at least visually
>>> compare the results to see where the differences may be.
>>>
>>> The model object returned from using glm() will contain both the
>>> linear predictors on the link scale (model$linear.predictors) and
>>> the fitted values (model$fitted.values). The latter can be accessed
>>> using the fitted() extractor function.
>>>
>>> To use an example, let's create a simple LR model using the infert
>>> data set as referenced in ?glm.
>>>
>>> model1 <- glm(case ~ spontaneous + induced, data = infert,
>>>              family = binomial())
>>>
>>>> model1
>>> Call:  glm(formula = case ~ spontaneous + induced,
>>             family = binomial(), data = infert)
>>>
>>> Coefficients:
>>> (Intercept)  spontaneous      induced
>>>     -1.7079       1.1972       0.4181
>>>
>>> Degrees of Freedom: 247 Total (i.e. Null);  245 Residual
>>> Null Deviance:            316.2
>>> Residual Deviance: 279.6      AIC: 285.6
>>>
>>> # Get the coefficients
>>>> coef(model1)
>>> (Intercept) spontaneous     induced
>>>  -1.7078601   1.1972050   0.4181294
>>>
>>> # get the linear predictor values
>>> # log odds scale for binomial glm
>>>> head(model1$linear.predictors)
>>>           1           2           3           4           5
>>> 6
>>>  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>> 0.32560375
>>>
>>> You can also get the above by using the coefficients and the model
>>> matrix for comparison:
>>> # the full set of 248
>>> # coef(model1) %*% t(model.matrix(model1))
>>>> head(as.vector(coef(model1) %*% t(model.matrix(model1))))
>>> [1]  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>> 0.32560375
>>>
>>> # get fitted response values (predicted probs)
>>>> head(fitted(model1))
>>>         1         2         3         4         5         6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> We can also get the fitted values from the linear predictor values
>>> by using:
>>>
>>> LP <- model1$linear.predictors
>>>> head(exp(LP) / (1 + exp(LP)))
>>>        1         2         3         4         5         6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> You can also get the above by using the predict.glm() function with
>>> type == "response". The default type of "link" will get you the  
>>> linear
>>> predictor values as above. predict.glm() would typically be used to
>>> generate predictions using the model on new data.
>>>
>>>> head(predict(model1, type = "response"))
>>>         1         2         3         4         5         6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> In glm.fit(), which is the workhorse function in glm(), the fitted
>>> values returned in the model object are actually computed by using
>>> the inverse link function for the family passed to glm():
>>>
>>>> binomial()$linkinv
>>> function (eta)
>>> .Call("logit_linkinv", eta, PACKAGE = "stats")
>>> <environment: 0x171c8b10>
>>>
>>> Thus:
>>>> head(binomial()$linkinv(model1$linear.predictors))
>>>         1         2         3         4         5         6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> So those are the same values as we saw above using the other methods.
>>> So, all is consistent across the various methods.
>>>
>>> Perhaps the above provides some insights for you into how R does some
>>> of these things and also to point out, as is frequently the case with
>>> R, there is more than one way to get the same result.
>>>
>>> HTH,
>>> Marc Schwartz
>>
>> An interesting and informative reply, Marc; but I think it misses
>> the point of Fabrizio's query. I think Fabrizio's point is the
>> following:
>>
>>  set.seed(54321)
>>  X <- sort(rnorm(100))
>>  a0 <- 1.0 ; b0 <- 0.5
>>  Y <- 1*(runif(100) < a0 + b0*X)
>>  GLM <- glm(Y~X,family=binomial)
>>  C <- GLM$coef
>>  C
>>  # (Intercept)           X
>>  #    2.726966    2.798429
>>  a1 <- C[1] ; b1 <- C[2]
>>  Fit0 <- GLM$fit
>>  S <- a1 + b1*X
>>  Fit1 <- exp(S)/(1+exp(S))
>>  max(abs(Fit1 - Fit0))
>>  # [1] 1.110223e-16
>>
>> This discrepancy is of course, in magnitude, a typical "rounding
>> error". But the fact that it occurred indicates that when glm()
>> computed the fitted values it did not do so by using the fitted
>> coefficients GLM$coef, then creating the fitted score (linear
>> predictor) S (as above), then applyhing to this the "inverse link"
>> exp(S)/(1+exp(S)), since doing that would replicate the above
>> calculation and should yield identical results.
>>
>> In fact, a bit more probing to GLM shows that there is already
>> a discrepancy at the "score" level:
>>
>>  S0 <- GLM$linear.predictors
>>  max(abs(S0-S))
>>  # [1] 8.881784e-16
>>
>> so S0 has not been calculated by applying GLM$coef to X. Also, if
>> we apply the inverse link to S0, then:
>>
>>  max(abs(Fit0 - exp(S0)/(1+exp(S0))))
>>  # [1] 1.110223e-16
>>
>> which is the same discrepancy as between Fit1 and Fit0.
>>
>>  max(abs(Fit1 - exp(S0)/(1+exp(S0))))
>>  # [1] 1.110223e-16
>>
>> the same again!!
>>
>> Hence, if I have understood him aright, Fabrizio's question.
>>
>> Hoping this helps,
>> Ted.
> 
> 
> Ted,
> 
> What OS and version of R are you using?
> 
> I am on OSX 10.5.7 and using 32 bit:
> 
>    R version 2.9.1 Patched (2009-06-30 r48877)
> 
> The reason that I ask is that using your data and model, I get:
> 
>  > max(abs(Fit1 - Fit0))
> [1] 0
> 
>  >  max(abs(S0-S))
> [1] 0
> 
>  >  max(abs(Fit0 - exp(S0)/(1+exp(S0))))
> [1] 0
> 
> 
> Truth be told, I was initially leaning in the direction of a rounding  
> error, but first wanted to be sure that there was not some underlying  
> methodological error that Fabrizio was encountering. Hence the focus  
> of my reply was to present a level of consistency in getting the  
> results using multiple methods.
> 
> When I saw your post, I started thinking that my example, which did  
> not have a numeric (eg. non-integer) covariate, might have been too  
> simple and missed introducing an additional source of rounding error,  
> but then I could not replicate your findings either...
> 
> Thanks,
> Marc

I'm using Debian Linux:
  Linux deb 2.6.18-6-686 #1 SMP Tue May 5 00:40:20 UTC 2009 i686 GNU/Linux
(32-bit as far as 5 know) and R version:
  version.string R version 2.9.0 (2009-04-17)

After posting, I began to suspect that the compiled code which does
the actual comnputations may be working to higher precision internally,
but rounding to R's:

  $double.eps
  [1] 2.220446e-16

  $double.neg.eps
  [1] 1.110223e-16

before attaching the results to the return-list for glm() (compare
the numerical discrepancies which I found above).

However, that it only surmise!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-Jun-09                                       Time: 18:54:19
------------------------------ XFMail ------------------------------