[Rd] Wrongly converging glm()

Thu Jul 20 17:38:58 CEST 2017

Allow me to chime in. That's an interesting case you present, but as far as
I'm concerned the algorithm did converge. The estimate of -9.25 has an
estimated standard error of 72.4, meaning that frequentists would claim the
true value would lie anywhere between appx. -151 and 132 (CI) and hence the
estimate from the glm algorithm is perfectly compatible with the one from
the C++ code. And as the glm algorithm uses a different convergence rule,
the algorithm rightfully reported it converged. It's not because another
algorithm based on another rule doesn't converge, that the one glm uses
didn't.

On top of that: In both cases the huge standard error on that estimate
clearly tells you that the estimate should not be trusted, and the fit is
unstable. That's to be expected, given the insane inbalance in your data,
especially for the 13th column. If my students would incorporate that
variable in a generalized linear model and tries to formulate a conclusion
based on that coefficient, they failed the exam. So if somebody does this
analysis and tries to draw any conclusion whatsoever on that estimate,
maybe they should leave the analysis to somebody who does know what they're
doing.

Cheers
Joris

On Thu, Jul 20, 2017 at 5:02 PM, Harm-Jan Westra <westra.harmjan at outlook.com
> wrote:

> Dear R-core,
>
>
> I have found an edge-case where the glm function falsely concludes that
> the model has converged. The issue is the following: my data contains a
> number of covariates, one of these covariates has a very small variance.
> For most of the rows of this covariate, the value is 0, except for one of
> the rows, where it is 1.
>
>
> The glm function correctly determines the beta and standard error
> estimates for all other covariates.
>
>
> I've placed the data here: http://www.harmjanwestra.nl/rtestdata.txt
>
>
> The model I'm using is very simple:
>
>
> data <- read.table("rtestdata.txt")
>
> model <- glm(data[,1] ~ data[,2] + data[,3] + data[,4] + data[,5] +
> data[,6] + data[,7] + data[,8] + data[,9] + data[,10] + data[,11] +
> data[,12] + data[,13] + data[,14], family=binomial("logit"))
>
> summary(model)
>
>
> You will see that for covariate data[,13], the beta/coefficient estimate
> is around -9. The number of iterations that has been performed is 8, and
> model$converged returns TRUE.
>
>
> I've used some alternate logistic regression code in C (
> https://github.com/czep/mlelr/blob/master/src/mlelr.c), which produces
> identical estimates for the other covariates and comparable deviance
> values. However, using this C code, I'm seeing that the estimate for
> data[,13] is around -100 (since I'm allowing a maximum of 100 MLE
> iterations). There, the conclusion is that the model does not converge.
>
>
> The difference between the two pieces of code is that in R, the glm()
> function determines convergence of the whole model by measuring the
> difference between deviance of the current iteration versus the deviance of
> the prior iteration, and calls the model converged when it reaches a
> certain epsilon value. In the C++ code, the model is converged when all
> parameters haven't changed markedly compared to the previous iteration.
>
>
> I think both approaches are valid, although the R variant (while faster)
> makes it vulnerable to wrongly concluding convergence in edge cases such as
> the one presented above, resulting in wrong coefficient estimates. For
> people wanting to use logistic regression in a training/prediction kind of
> setting, using these estimates might influence their predictive performance.
>
>
> The problem here is that the glm function does not return any warnings
> when one of the covariates in the model does not converge. For someone who
> is not paying attention, this may lead them to conclude there is nothing
> wrong with their data. In my opinion, the default behavior in this case
> should therefore be to conclude that the model did not converge, or at
> least to show a warning message.
>
>
> Please let me know whether you believe this is an issue, and whether I can
> provide additional information.
>
>
> With kind regards,
>
>
> Harm-Jan Westra
>
>
>
>
>
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]