[R] glm.fit: fitted probabilities numerically 0 or 1 occurred?

(Ted Harding) Ted.Harding at wlandres.net
Wed Mar 21 16:35:05 CET 2012

On 21-Mar-2012 S Ellison wrote:
>> I get the errors;
>> glm.fit: fitted probabilities numerically 0 or 1 occurred and
>> glm.fit: algorithm did not converge
>> .....
>> Is there any way to to 
>> fix this problem? 
> There are two separate issues.
> One is the appearance of fitted values at 0 or 1. 
> The other is the lack of convergence.
> The first is usually not fatal; it means that the probabilities
> are so close to 0 or 1 that a double precision value can't
> distinguish them from 0 or 1.
> Often that's a transient condition during iteration and the
> final fitted values are inside (0,1), but final fitted values
> can also be out there if you have continuous predictor values
> a long way out;  by itself, that usually won't stop a glm.
> The second is a bit more problematic. Sometimes it's just that
> you need to increase the maximum number of iterations (see the
> control= argument and ?glm.control). That is always worth a try
> - use some absurdly high number like 1000 instead of the default
> 25 and go find some coffee while it thinks about it. If that
> solves your problem then you're OK, or at least as OK as you can
> be with a data set that hard to fit.
> But if you're bootstrapping with some anomalous values it is
> also possible that some of your bootstrapped sets have too high
> a proportion of anomalies, and under those conditions it's
> possible that there could be no sensible optimum within reach. 
> One way of dealing with that in a boostrap or other simulation
> context is to check the 'converged' value and if it's FALSE,
> return an NA for your statistic. But of course that is a form
> of censoring; if you have a high proportion of such instances
> you'd be on very thin ice drawing conclusions.
> S Ellison

There is a further general issue. The occurence of

  glm.fit: fitted probabilities numerically 0 or 1 occurred and
  glm.fit: algorithm did not converge

is usually a symptom of what is called "linear separation".
The simplest instance of this is when there is a single
predictor variable X whose values are, say, ordered as
X[1] < X[2] < ... < X[n-1] < X[n].

If it should happen that the values Y[1] ... Y[n] of the 0/1
response variable Y are all equal to 0 for X[1],...,X[k]
and all equal to 1 for X[k+1],...,X[n], and you are fitting
a glm(Y~X,family=binomial) (which is what you use for logistic
regression), then the formula being fitted is

  log(P(Y=1|X)/(Y=0|X)) = (X - m)/s for some values of m and s.

In the circumstances described, for any value of m between

  X[k] < m < X[k+1]

the fit can force P[Y=1|X] --> 0 for X = X[1],...,X[k]
and can force P[Y=1|X] --> 1 for X = X[k+1],...,X[n] by
pushing s --> 0 (so then the first set of (X - m)/s --> -Inf
and the second set of (X - m)/s --> +Inf), thus forcing the
probabilities predicted by the model towards agreeing exactly
with the frequencies observed in the data.

In other words, any value of m satisfying X[k] < m < X[k+1]
*separates* the Y=0 responses from the Y=1 responses: all the
Y=0 responses are for X-values less than m, and all the Y=1
responses are for values of X greater than m. Since such a
value of m is not unique (any value between X[k] and X[k+1]
will do), failure of convergence will be observed.

This can certainly arise from excessively "outlying" data,
i.e. the difference X[k+1] - X[k] is much greater than the
true value of the logistic scale parameter s, so that it
is almost certain that all responses for X <= X[k] will be 0,
and all responses for X >= X[k+1] will be 1.

But it can also readily arise when there are only a few data,
i.e. n is small; and also, even with many data, if the scale
paramater is n ot large compared with the spacing between
X values. If you run the following code a few times, you
will see that "separation" occurs with about 30% probability.

  n <- 6
  X <- (1:n) ; m <- (n+1)/2 ; s <- 2
  count <- numeric(100)
  for(i in (1:100)) {
    Y <- 1*(runif(n) <= exp((X-m)/s)/(1+exp((X-m)/s)))
    count[i] <- (max(X[Y==0]) < min(X[Y==1]))

Similarly, with n <- 10, it occurs about 20% of the time;
for n <- 15 about 15%, and it never gets much below 15%
no matter how large n is.

The more general situation of "linear separation" is where
you have many predictor variables, when there can be a high
probability that the random outcomes Y=0 label one set of
cases, the Y=1 label another set of cases, and the X-vectors
for Y=0 can be separated from the X-vectors for Y=1 by a
linear form, i.e. there is a vector of coefficients beta
such that

  X[for any Y==0]%*%beta < X[for any Y==1]%*%beta

This also can arise quite readily in practice.

There is no definitive procedure for coping with this situation
when it arises. What the fitting procedure is doing is just
what it is supposed to do, namely predicting P(Y=1)=0 for
every X such that the observed Y=0, and predicting P(Y=1)=1
for every X such that the observed Y=1 whenever it can (i.e.
whenever there is separation). So indeed this is a result
of maximum-likelihood estimation in that it is seeking values
of the parameters

There are work-rounds which are variants of penalising small
values of the scale paramater s, so that what is being
maximised is not the plain likelihood L but L moderated
by the penalty so that for the smaller values of s the
penalised likelihood will be less than the plain likelihood
(and, for really small s, much less).

However, choosing a penalising strategy is not well-defined
from a theoretical point of view, and will involve an element
of arbitrary choice, i.e. a penalisation function which
appropriately reflects your objectives.

Hoping this helps,

PS Try searching for "linear separation" and "linear separability"
in the searchable R-help archives at


(the old R Site Help and Archive Search at http://finzi.psych.upenn.edu/
is no longer of much use).

E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 21-Mar-2012  Time: 15:35:00
This message was sent by XFMail

More information about the R-help mailing list