[R] fitting in logistic model

Tue Jun 30 21:12:03 CEST 2009

On Jun 30, 2009, at 12:54 PM, Ted Harding wrote:

>
> On 30-Jun-09 17:41:20, Marc Schwartz wrote:
>> On Jun 30, 2009, at 10:44 AM, Ted Harding wrote:
>>
>>>
>>> On 30-Jun-09 14:52:20, Marc Schwartz wrote:
>>>> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
>>>>
>>>>> I would like to know how R computes the probability of an event
>>>>> in a logistic model (P(y=1)) from the score s, linear combination
>>>>> of x and beta.
>>>>>
>>>>> I noticed that there are differences (small, less than e-16)  
>>>>> between
>>>>> the fitting values automatically computed in the glm procedure  
>>>>> by R,
>>>>> and the values "manually" computed by me applying the reverse
>>>>> formula p=e^s/(1+e^s); moreover I noticed  that the minimum value
>>>>> of the fitting values in my estimation is 2.220446e-16, and there
>>>>> are many observation with this probability (instead the minimum
>>>>> value obtained by "manually" estimation is 2.872636e-152).
>>>>
>>>> It would be helpful to see at least a subset of the output from  
>>>> your
>>>> model and your manual computations so that we can at least visually
>>>> compare the results to see where the differences may be.
>>>>
>>>> The model object returned from using glm() will contain both the
>>>> linear predictors on the link scale (model$linear.predictors) and
>>>> the fitted values (model$fitted.values). The latter can be accessed
>>>> using the fitted() extractor function.
>>>>
>>>> To use an example, let's create a simple LR model using the infert
>>>> data set as referenced in ?glm.
>>>>
>>>> model1 <- glm(case ~ spontaneous + induced, data = infert,
>>>>             family = binomial())
>>>>
>>>>> model1
>>>> Call:  glm(formula = case ~ spontaneous + induced,
>>>            family = binomial(), data = infert)
>>>>
>>>> Coefficients:
>>>> (Intercept)  spontaneous      induced
>>>>    -1.7079       1.1972       0.4181
>>>>
>>>> Degrees of Freedom: 247 Total (i.e. Null);  245 Residual
>>>> Null Deviance:            316.2
>>>> Residual Deviance: 279.6      AIC: 285.6
>>>>
>>>> # Get the coefficients
>>>>> coef(model1)
>>>> (Intercept) spontaneous     induced
>>>> -1.7078601   1.1972050   0.4181294
>>>>
>>>> # get the linear predictor values
>>>> # log odds scale for binomial glm
>>>>> head(model1$linear.predictors)
>>>>          1           2           3           4           5
>>>> 6
>>>> 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>>> 0.32560375
>>>>
>>>> You can also get the above by using the coefficients and the model
>>>> matrix for comparison:
>>>> # the full set of 248
>>>> # coef(model1) %*% t(model.matrix(model1))
>>>>> head(as.vector(coef(model1) %*% t(model.matrix(model1))))
>>>> [1]  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>>> 0.32560375
>>>>
>>>> # get fitted response values (predicted probs)
>>>>> head(fitted(model1))
>>>>        1         2         3         4         5         6
>>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>>
>>>> We can also get the fitted values from the linear predictor values
>>>> by using:
>>>>
>>>> LP <- model1$linear.predictors
>>>>> head(exp(LP) / (1 + exp(LP)))
>>>>       1         2         3         4         5         6
>>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>>
>>>> You can also get the above by using the predict.glm() function with
>>>> type == "response". The default type of "link" will get you the
>>>> linear
>>>> predictor values as above. predict.glm() would typically be used to
>>>> generate predictions using the model on new data.
>>>>
>>>>> head(predict(model1, type = "response"))
>>>>        1         2         3         4         5         6
>>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>>
>>>> In glm.fit(), which is the workhorse function in glm(), the fitted
>>>> values returned in the model object are actually computed by using
>>>> the inverse link function for the family passed to glm():
>>>>
>>>>> binomial()$linkinv
>>>> function (eta)
>>>> .Call("logit_linkinv", eta, PACKAGE = "stats")
>>>> <environment: 0x171c8b10>
>>>>
>>>> Thus:
>>>>> head(binomial()$linkinv(model1$linear.predictors))
>>>>        1         2         3         4         5         6
>>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>>
>>>> So those are the same values as we saw above using the other  
>>>> methods.
>>>> So, all is consistent across the various methods.
>>>>
>>>> Perhaps the above provides some insights for you into how R does  
>>>> some
>>>> of these things and also to point out, as is frequently the case  
>>>> with
>>>> R, there is more than one way to get the same result.
>>>>
>>>> HTH,
>>>> Marc Schwartz
>>>
>>> An interesting and informative reply, Marc; but I think it misses
>>> the point of Fabrizio's query. I think Fabrizio's point is the
>>> following:
>>>
>>> set.seed(54321)
>>> X <- sort(rnorm(100))
>>> a0 <- 1.0 ; b0 <- 0.5
>>> Y <- 1*(runif(100) < a0 + b0*X)
>>> GLM <- glm(Y~X,family=binomial)
>>> C <- GLM$coef
>>> C
>>> # (Intercept)           X
>>> #    2.726966    2.798429
>>> a1 <- C[1] ; b1 <- C[2]
>>> Fit0 <- GLM$fit
>>> S <- a1 + b1*X
>>> Fit1 <- exp(S)/(1+exp(S))
>>> max(abs(Fit1 - Fit0))
>>> # [1] 1.110223e-16
>>>
>>> This discrepancy is of course, in magnitude, a typical "rounding
>>> error". But the fact that it occurred indicates that when glm()
>>> computed the fitted values it did not do so by using the fitted
>>> coefficients GLM$coef, then creating the fitted score (linear
>>> predictor) S (as above), then applyhing to this the "inverse link"
>>> exp(S)/(1+exp(S)), since doing that would replicate the above
>>> calculation and should yield identical results.
>>>
>>> In fact, a bit more probing to GLM shows that there is already
>>> a discrepancy at the "score" level:
>>>
>>> S0 <- GLM$linear.predictors
>>> max(abs(S0-S))
>>> # [1] 8.881784e-16
>>>
>>> so S0 has not been calculated by applying GLM$coef to X. Also, if
>>> we apply the inverse link to S0, then:
>>>
>>> max(abs(Fit0 - exp(S0)/(1+exp(S0))))
>>> # [1] 1.110223e-16
>>>
>>> which is the same discrepancy as between Fit1 and Fit0.
>>>
>>> max(abs(Fit1 - exp(S0)/(1+exp(S0))))
>>> # [1] 1.110223e-16
>>>
>>> the same again!!
>>>
>>> Hence, if I have understood him aright, Fabrizio's question.
>>>
>>> Hoping this helps,
>>> Ted.
>>
>>
>> Ted,
>>
>> What OS and version of R are you using?
>>
>> I am on OSX 10.5.7 and using 32 bit:
>>
>>   R version 2.9.1 Patched (2009-06-30 r48877)
>>
>> The reason that I ask is that using your data and model, I get:
>>
>>> max(abs(Fit1 - Fit0))
>> [1] 0
>>
>>> max(abs(S0-S))
>> [1] 0
>>
>>> max(abs(Fit0 - exp(S0)/(1+exp(S0))))
>> [1] 0
>>
>>
>> Truth be told, I was initially leaning in the direction of a rounding
>> error, but first wanted to be sure that there was not some underlying
>> methodological error that Fabrizio was encountering. Hence the focus
>> of my reply was to present a level of consistency in getting the
>> results using multiple methods.
>>
>> When I saw your post, I started thinking that my example, which did
>> not have a numeric (eg. non-integer) covariate, might have been too
>> simple and missed introducing an additional source of rounding error,
>> but then I could not replicate your findings either...
>>
>> Thanks,
>> Marc
>
> I'm using Debian Linux:
>  Linux deb 2.6.18-6-686 #1 SMP Tue May 5 00:40:20 UTC 2009 i686 GNU/ 
> Linux
> (32-bit as far as 5 know) and R version:
>  version.string R version 2.9.0 (2009-04-17)
>
> After posting, I began to suspect that the compiled code which does
> the actual comnputations may be working to higher precision  
> internally,
> but rounding to R's:
>
>  $double.eps
>  [1] 2.220446e-16
>
>  $double.neg.eps
>  [1] 1.110223e-16
>
> before attaching the results to the return-list for glm() (compare
> the numerical discrepancies which I found above).
>
> However, that it only surmise!
> Ted.

Well....barring a correction from someone with more low level insight,  
I would tend to suspect that the difference in our findings may be  
attributed to some interaction of hardware, OS, compiler and BLAS,  
perhaps weighted more to the latter two. When I build from source, I  
do use the Apple vecLib BLAS (the default), as opposed to the R BLAS.

To distill this down to the key issue, all else being equal  
(within .Machine$double.eps), I suspect that this leads us back to  
rounding as being the source of Fabrizio's observations.

Marc