[Rd] Wrongly converging glm()

Thu Jul 20 18:45:56 CEST 2017

Dear Simon,

Thanks for your response. I have a suggestion that could be non-intrusive, but still provide some additional info to the user.

The glm function already checks for collinearity of the input, and you can easily check which covariate was aliased as a result, using summary(model)$aliased.

My suggestion would therefore be to add an additional variable to the model summary, e.g. summary(model)$estimateconverged, which states whether the MLE has converged for that particular covariate.

This would provide users a way to perform sanity checks on the models they fit: sometimes you're running hundreds or millions of models, making it infeasible to check every single one of them. I agree that investigating the estimates + standard errors would be a solution, but then again, the estimate that R produces for such a covariate might as well be random.

With kind regards,

Harm-Jan

________________________________
From: Simon Bonner <sbonner6 at uwo.ca>
Sent: Thursday, July 20, 2017 12:32 PM
To: Joris Meys; Harm-Jan Westra
Cc: r-devel at r-project.org
Subject: RE: [Rd] Wrongly converging glm()

In defence of Harma-Jan's original post I would say that there is a difference between true convergence and satisfying a convergence criterion.

In my view the algorithm has not converged. This is a case of quasi-complete separate -- there are both successes and failures when x13=0 but only failures when x13=1. As a result, the likelihood has no maximum and increases, albeit slightly, as the associated coefficient tends to infinity while maximizing over the other parameters. The estimate given is not the MLE and the standard error is not meaningful because the conditions for convergence of MLEs to their asymptotic normal distribution has been violated.

I agree with Joris that someone familiar with logistic regression should be able to identify this situation -- though the solution is not as simple as throwing out the covariate. Suppose that there had been many failures when x13=1, not just 1. The same problem would arise, but x13 is clearly an important covariate. Removing it from the analysis is not the thing to do. A better solution is to penalize the likelihood or (and I'm showing my true colours here) conduct a Bayesian analysis.

Regarding the statement that the algorithm has converged, perhaps R should say more truthfully that the convergence criterion has been satisfied -- but that might lead to more confusion. In this case, any convergence criterion will be satisfied eventually. If you increase the maximum number of iterations in the C implementation then the other convergence criterion will be satisfied and the code will say that the algorithm has converged.

In the end, it's up to the analyst to be aware of the pitfalls and how to address them.

Cheers,

Simon

> -----Original Message-----
> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Joris Meys
> Sent: July 20, 2017 11:39 AM
> To: Harm-Jan Westra <westra.harmjan at outlook.com>
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] Wrongly converging glm()
>
> Allow me to chime in. That's an interesting case you present, but as far as I'm
> concerned the algorithm did converge. The estimate of -9.25 has an estimated
> standard error of 72.4, meaning that frequentists would claim the true value
> would lie anywhere between appx. -151 and 132 (CI) and hence the estimate
> from the glm algorithm is perfectly compatible with the one from the C++ code.
> And as the glm algorithm uses a different convergence rule, the algorithm
> rightfully reported it converged. It's not because another algorithm based on
> another rule doesn't converge, that the one glm uses didn't.
>
> On top of that: In both cases the huge standard error on that estimate clearly
> tells you that the estimate should not be trusted, and the fit is unstable. That's
> to be expected, given the insane inbalance in your data, especially for the 13th
> column. If my students would incorporate that variable in a generalized linear
> model and tries to formulate a conclusion based on that coefficient, they failed
> the exam. So if somebody does this analysis and tries to draw any conclusion
> whatsoever on that estimate, maybe they should leave the analysis to
> somebody who does know what they're doing.
>
> Cheers
> Joris
>
> On Thu, Jul 20, 2017 at 5:02 PM, Harm-Jan Westra
> <westra.harmjan at outlook.com
> > wrote:
>
> > Dear R-core,
> >
> >
> > I have found an edge-case where the glm function falsely concludes
> > that the model has converged. The issue is the following: my data
> > contains a number of covariates, one of these covariates has a very small
> variance.
> > For most of the rows of this covariate, the value is 0, except for one
> > of the rows, where it is 1.
> >
> >
> > The glm function correctly determines the beta and standard error
> > estimates for all other covariates.
> >
> >
> > I've placed the data here: http://www.harmjanwestra.nl/rtestdata.txt
> >
> >
> > The model I'm using is very simple:
> >
> >
> > data <- read.table("rtestdata.txt")
> >
> > model <- glm(data[,1] ~ data[,2] + data[,3] + data[,4] + data[,5] +
> > data[,6] + data[,7] + data[,8] + data[,9] + data[,10] + data[,11] +
> > data[,12] + data[,13] + data[,14], family=binomial("logit"))
> >
> > summary(model)
> >
> >
> > You will see that for covariate data[,13], the beta/coefficient
> > estimate is around -9. The number of iterations that has been
> > performed is 8, and model$converged returns TRUE.
> >
> >
> > I've used some alternate logistic regression code in C (
> > https://github.com/czep/mlelr/blob/master/src/mlelr.c), which produces
> > identical estimates for the other covariates and comparable deviance
> > values. However, using this C code, I'm seeing that the estimate for
> > data[,13] is around -100 (since I'm allowing a maximum of 100 MLE
> > iterations). There, the conclusion is that the model does not converge.
> >
> >
> > The difference between the two pieces of code is that in R, the glm()
> > function determines convergence of the whole model by measuring the
> > difference between deviance of the current iteration versus the
> > deviance of the prior iteration, and calls the model converged when it
> > reaches a certain epsilon value. In the C++ code, the model is
> > converged when all parameters haven't changed markedly compared to the
> previous iteration.
> >
> >
> > I think both approaches are valid, although the R variant (while
> > faster) makes it vulnerable to wrongly concluding convergence in edge
> > cases such as the one presented above, resulting in wrong coefficient
> > estimates. For people wanting to use logistic regression in a
> > training/prediction kind of setting, using these estimates might influence
> their predictive performance.
> >
> >
> > The problem here is that the glm function does not return any warnings
> > when one of the covariates in the model does not converge. For someone
> > who is not paying attention, this may lead them to conclude there is
> > nothing wrong with their data. In my opinion, the default behavior in
> > this case should therefore be to conclude that the model did not
> > converge, or at least to show a warning message.
> >
> >
> > Please let me know whether you believe this is an issue, and whether I
> > can provide additional information.
> >
> >
> > With kind regards,
> >
> >
> > Harm-Jan Westra
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]