[Rd] Bug in glm.fit? (PR#1331)
   
    berwin@maths.uwa.edu.au
     
    berwin@maths.uwa.edu.au
       
    Wed, 27 Feb 2002 09:16:09 +0100 (MET)
    
    
  
G'day all,
I had a look at the GLM code of R (1.4.1) and I believe that there are
problems with the function "glm.fit" that may bite in rare
circumstances.  Note, I have no data set with which I ran into
trouble.  This report is solely based on having a look at the code.
Below I append a listing of the glm.fit function as produced by my
system.  I have added line numbers so that I can easily refer to the
code that is, in my opinion, problematic.  I also append a modified
version that, hopefully, solves these problems.  At least on
"example(glm)" both functions produce the same output (if the random
number generator is set to the same seed).
Short summary: I envisage problems with glm.fit if two conditions are
met simultaneously.  Namely, (1) the design matrix is such that the QR
algorithm used in the weighted least squares (WLS) step has to pivot
columns (lines 81-86) and (2) during the iterations the step size has
to be truncated (either lines 98-114 or lines 115-130).
More details:
In line 90, the variables "start" and "coef" are set to the
coefficients of the WLS and "start" is then permuted (line 91) to take
a possible pivoting of the QR algorithm into account.  Note that
"coef" is not permutated.   Also, since the weights in the iterative
weighted least squares algorithm change in each iteration (lines
60-139) it is theoretically possible that on some executions of this
chunk pivoting occurs and on others there is no pivoting.
Hoever, if the new coefficients are unsuitable and either the block at
lines 98-114 or lines 115-130 is entered, then "start" is modified by
shrinking it toward "coefold".  There are two problems:
1) If this happens the first time the loop in lines 60-139 is executed,
   then coefold is initialised to the value of "start" that was passed
down to the routine (line 58).  This might well be NULL (usually is?).
In this case line 105 and line 122 would produce a new "start" which
is numeric(0) and glm.fit would stop execution pretty soon thereafter.
2) If it happens during a later execution of the loop in lines 60-139,
   then "coefold" was set to the value "coef" of the prior execution
of the loop (line 137).  Depending on whether either of the code
chunks at lines 98-114 or lines 115-130 have been executed on that
previous occastion, "coefold" may or may not be permuted to the same
order as "start" in this moment.  Hence, it could happen that "start"
is shrunken towards something that doesn't make sense in line 105 or
line 122.
Relating to point 2) above.  Theoretically, during the main loop
(lines 60-139) either of the chunks in lines 98-114 or lines 115-130
could be entered.  In these chunks "coef" is set to the shrunken
"start" at the end (line 111 or line 126).  If, with the ``deviance
for the shrunken "start"'' the convergence criterion in line 131 is
fulfilled, then the main loop would finish.  In other words, at line
140 there is no way to tell whether "coef" is in the order of
"fit$coefficients" or has been permutated to have the same order as
"start".  In the later case, the additional permutation in line 157
would garble the solution.
I hope that the revised version that I append below, works around all
these potential problems without breaking anything.  As I said, I ran
"example(glm)" with the revised version and got the same results.  But
I guess you guys have far more exhaustive tests and may want to do
some further test to ensure that nothing is broken.
The revised version also makes some cosmetic changes that reflect my
preferences :-) :
1) Given the default values of the parameters "weights" and "offset" I
   was surprised about the tests in lines 17 and 19.  They seemed
superfluous.  Until I noticed that "glm" can call "glm.fit" with
actual arguments for these parameters that are NULL.  Hence, I suggest
that the default values for these arguments should be NULL.  This
makes it easier to understand the code of "glm.fit" (if one looks only
at "glm.fit" and not at all the code "glm", "glm.fit" &c
simultaneously). 
2) As mentioned above, "coefold" may be initialised to NULL in line 58
   if the parameter "start" is not specified.  (Which typically would be
the case?)  But furthermore, in lines 44-53 the test for the parameter
"start" is mangled into the test for the parameter "etastart".  If
"etastart" is not specified but a bad "start" (not the correct
length), the bad "start" would be caught.  But if a valid "etastart"
is specified, then a bad "start" would not be caught since those tests
are not executed.  (Are there users that specify both?).  Anyway, I
have untangled those sanity check.  Also to get a clean initialisation
for "coefold".
3) Given that the column names of fit$qr are (correctly) set in line
   171, I didn't see the purpose of line 155.  That line anyhow sets
the column names wrongly if pivoting has occured.  So I deleted that
line. 
That's all for now.  Thanks for the great package and I hope that this
bug report will be useful.
Cheers,
        Berwin
========================= original glm.fit =========================
  1  glm.fit <-
  2  function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL, 
  3      mustart = NULL, offset = rep(0, nobs), family = gaussian(), 
  4      control = glm.control(), intercept = TRUE) 
  5  {
  6      x <- as.matrix(x)
  7      xnames <- dimnames(x)[[2]]
  8      ynames <- names(y)
  9      conv <- FALSE
 10      nobs <- NROW(y)
 11      nvars <- NCOL(x)
 12      if (nvars == 0) {
 13          cc <- match.call()
 14          cc[[1]] <- as.name("glm.fit.null")
 15          return(eval(cc, parent.frame()))
 16      }
 17      if (is.null(weights)) 
 18          weights <- rep(1, nobs)
 19      if (is.null(offset)) 
 20          offset <- rep(0, nobs)
 21      variance <- family$variance
 22      dev.resids <- family$dev.resids
 23      aic <- family$aic
 24      linkinv <- family$linkinv
 25      mu.eta <- family$mu.eta
 26      if (!is.function(variance) || !is.function(linkinv)) 
 27          stop("illegal `family' argument")
 28      valideta <- family$valideta
 29      if (is.null(valideta)) 
 30          valideta <- function(eta) TRUE
 31      validmu <- family$validmu
 32      if (is.null(validmu)) 
 33          validmu <- function(mu) TRUE
 34      if (is.null(mustart)) {
 35          eval(family$initialize)
 36      }
 37      else {
 38          mukeep <- mustart
 39          eval(family$initialize)
 40          mustart <- mukeep
 41      }
 42      if (NCOL(y) > 1) 
 43          stop("y must be univariate unless binomial")
 44      eta <- if (!is.null(etastart) && valideta(etastart)) 
 45          etastart
 46      else if (!is.null(start)) 
 47          if (length(start) != nvars) 
 48              stop(paste("Length of start should equal", nvars, 
 49                  "and correspond to initial coefs for", deparse(xnames)))
 50          else as.vector(if (NCOL(x) == 1) 
 51              x * start
 52          else x %*% start)
 53      else family$linkfun(mustart)
 54      mu <- linkinv(eta)
 55      if (!(validmu(mu) && valideta(eta))) 
 56          stop("Can't find valid starting values: please specify some")
 57      devold <- sum(dev.resids(y, mu, weights))
 58      coefold <- start
 59      boundary <- FALSE
 60      for (iter in 1:control$maxit) {
 61          good <- weights > 0
 62          varmu <- variance(mu)[good]
 63          if (any(is.na(varmu))) 
 64              stop("NAs in V(mu)")
 65          if (any(varmu == 0)) 
 66              stop("0s in V(mu)")
 67          mu.eta.val <- mu.eta(eta)
 68          if (any(is.na(mu.eta.val[good]))) 
 69              stop("NAs in d(mu)/d(eta)")
 70          good <- (weights > 0) & (mu.eta.val != 0)
 71          if (all(!good)) {
 72              conv <- FALSE
 73              warning(paste("No observations informative at iteration", 
 74                  iter))
 75              break
 76          }
 77          z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
 78          w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
 79          ngoodobs <- as.integer(nobs - sum(!good))
 80          ncols <- as.integer(1)
 81          fit <- .Fortran("dqrls", qr = x[good, ] * w, n = as.integer(ngoodobs), 
 82              p = nvars, y = w * z, ny = ncols, tol = min(1e-07, 
 83                  control$epsilon/1000), coefficients = numeric(nvars), 
 84              residuals = numeric(ngoodobs), effects = numeric(ngoodobs), 
 85              rank = integer(1), pivot = 1:nvars, qraux = double(nvars), 
 86              work = double(2 * nvars), PACKAGE = "base")
 87          if (nobs < fit$rank) 
 88              stop(paste("X matrix has rank", fit$rank, "but only", 
 89                  nobs, "observations"))
 90          start <- coef <- fit$coefficients
 91          start[fit$pivot] <- coef
 92          eta <- drop(x %*% start)
 93          mu <- linkinv(eta <- eta + offset)
 94          dev <- sum(dev.resids(y, mu, weights))
 95          if (control$trace) 
 96              cat("Deviance =", dev, "Iterations -", iter, "\n")
 97          boundary <- FALSE
 98          if (is.na(dev) || any(is.na(coef))) {
 99              warning("Step size truncated due to divergence")
100              ii <- 1
101              while ((is.na(dev) || any(is.na(start)))) {
102                  if (ii > control$maxit) 
103                    stop("inner loop 1; can't correct step size")
104                  ii <- ii + 1
105                  start <- (start + coefold)/2
106                  eta <- drop(x %*% start)
107                  mu <- linkinv(eta <- eta + offset)
108                  dev <- sum(dev.resids(y, mu, weights))
109              }
110              boundary <- TRUE
111              coef <- start
112              if (control$trace) 
113                  cat("New Deviance =", dev, "\n")
114          }
115          if (!(valideta(eta) && validmu(mu))) {
116              warning("Step size truncated: out of bounds.")
117              ii <- 1
118              while (!(valideta(eta) && validmu(mu))) {
119                  if (ii > control$maxit) 
120                    stop("inner loop 2; can't correct step size")
121                  ii <- ii + 1
122                  start <- (start + coefold)/2
123                  mu <- linkinv(eta <- eta + offset)
124              }
125              boundary <- TRUE
126              coef <- start
127              dev <- sum(dev.resids(y, mu, weights))
128              if (control$trace) 
129                  cat("New Deviance =", dev, "\n")
130          }
131          if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
132              conv <- TRUE
133              break
134          }
135          else {
136              devold <- dev
137              coefold <- coef
138          }
139      }
140      if (!conv) 
141          warning("Algorithm did not converge")
142      if (boundary) 
143          warning("Algorithm stopped at boundary value")
144      eps <- 10 * .Machine$double.eps
145      if (family$family == "binomial") {
146          if (any(mu > 1 - eps) || any(mu < eps)) 
147              warning("fitted probabilities numerically 0 or 1 occurred")
148      }
149      if (family$family == "poisson") {
150          if (any(mu < eps)) 
151              warning("fitted rates numerically 0 occurred")
152      }
153      if (fit$rank != nvars) {
154          coef[seq(fit$rank + 1, nvars)] <- NA
155          dimnames(fit$qr) <- list(NULL, xnames)
156      }
157      coef[fit$pivot] <- coef
158      xxnames <- xnames[fit$pivot]
159      residuals <- rep(NA, nobs)
160      residuals[good] <- z - (eta - offset)[good]
161      fit$qr <- as.matrix(fit$qr)
162      nr <- min(sum(good), nvars)
163      if (nr < nvars) {
164          Rmat <- diag(nvars)
165          Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
166      }
167      else Rmat <- fit$qr[1:nvars, 1:nvars]
168      Rmat <- as.matrix(Rmat)
169      Rmat[row(Rmat) > col(Rmat)] <- 0
170      names(coef) <- xnames
171      colnames(fit$qr) <- xxnames
172      dimnames(Rmat) <- list(xxnames, xxnames)
173      names(residuals) <- ynames
174      names(mu) <- ynames
175      names(eta) <- ynames
176      wt <- rep(0, nobs)
177      wt[good] <- w^2
178      names(wt) <- ynames
179      names(weights) <- ynames
180      names(y) <- ynames
181      names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("", sum(good) - 
182          fit$rank))
183      wtdmu <- if (intercept) 
184          sum(weights * y)/sum(weights)
185      else linkinv(offset)
186      nulldev <- sum(dev.resids(y, wtdmu, weights))
187      n.ok <- nobs - sum(weights == 0)
188      nulldf <- n.ok - as.integer(intercept)
189      resdf <- n.ok - fit$rank
190      aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
191      list(coefficients = coef, residuals = residuals, fitted.values = mu, 
192          effects = fit$effects, R = Rmat, rank = fit$rank, qr = fit[c("qr", 
193              "rank", "qraux", "pivot", "tol")], family = family, 
194          linear.predictors = eta, deviance = dev, aic = aic.model, 
195          null.deviance = nulldev, iter = iter, weights = wt, prior.weights = weights, 
196          df.residual = resdf, df.null = nulldf, y = y, converged = conv, 
197          boundary = boundary)
198  }
========================= modified glm.fit =========================
glm.fit <- 
function (x, y, weights = NULL, start = NULL, etastart = NULL, 
    mustart = NULL, offset = NULL, family = gaussian(), 
    control = glm.control(), intercept = TRUE) 
{
    x <- as.matrix(x)
    xnames <- dimnames(x)[[2]]
    ynames <- names(y)
    conv <- FALSE
    nobs <- NROW(y)
    nvars <- NCOL(x)
    if (nvars == 0) {
        cc <- match.call()
        cc[[1]] <- as.name("glm.fit.null")
        return(eval(cc, parent.frame()))
    }
    if (is.null(weights)) 
        weights <- rep(1, nobs)
    if (is.null(offset)) 
        offset <- rep(0, nobs)
    variance <- family$variance
    dev.resids <- family$dev.resids
    aic <- family$aic
    linkinv <- family$linkinv
    mu.eta <- family$mu.eta
    if (!is.function(variance) || !is.function(linkinv)) 
        stop("illegal `family' argument")
    valideta <- family$valideta
    if (is.null(valideta)) 
        valideta <- function(eta) TRUE
    validmu <- family$validmu
    if (is.null(validmu)) 
        validmu <- function(mu) TRUE
    if (is.null(mustart)) {
        eval(family$initialize)
    }
    else {
        mukeep <- mustart
        eval(family$initialize)
        mustart <- mukeep
    }
    if (NCOL(y) > 1) 
        stop("y must be univariate unless binomial")
    if (!is.null(start)){
      if (length(start) != nvars) 
        stop(paste("Length of start should equal", nvars, 
                   "and correspond to initial coefs for", deparse(xnames)))
      coef <- coefold <- start  # initialise coef here in order to
                                # be able to subset it below 
    }
    else
      coef <- coefold <- rep(0,nvars)
    eta <- if (!is.null(etastart) && valideta(etastart)) 
      etastart
    else if (!is.null(start)) 
      as.vector(if (NCOL(x) == 1) 
                x * start
      else x %*% start)
    else family$linkfun(mustart)
    mu <- linkinv(eta)
    if (!(validmu(mu) && valideta(eta))) 
        stop("Can't find valid starting values: please specify some")
    devold <- sum(dev.resids(y, mu, weights))
    boundary <- FALSE
    for (iter in 1:control$maxit) {
        good <- weights > 0
        varmu <- variance(mu)[good]
        if (any(is.na(varmu))) 
            stop("NAs in V(mu)")
        if (any(varmu == 0)) 
            stop("0s in V(mu)")
        mu.eta.val <- mu.eta(eta)
        if (any(is.na(mu.eta.val[good]))) 
            stop("NAs in d(mu)/d(eta)")
        good <- (weights > 0) & (mu.eta.val != 0)
        if (all(!good)) {
            conv <- FALSE
            warning(paste("No observations informative at iteration", 
                iter))
            break
        }
        z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
        w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
        ngoodobs <- as.integer(nobs - sum(!good))
        ncols <- as.integer(1)
        fit <- .Fortran("dqrls", qr = x[good, ] * w, n = as.integer(ngoodobs), 
            p = nvars, y = w * z, ny = ncols, tol = min(1e-07, 
                control$epsilon/1000), coefficients = numeric(nvars), 
            residuals = numeric(ngoodobs), effects = numeric(ngoodobs), 
            rank = integer(1), pivot = 1:nvars, qraux = double(nvars), 
            work = double(2 * nvars), PACKAGE = "base")
        if (nobs < fit$rank) 
            stop(paste("X matrix has rank", fit$rank, "but only", 
                nobs, "observations"))
        coef[fit$pivot] <- fit$coefficients
        eta <- drop(x %*% coef)
        mu <- linkinv(eta <- eta + offset)
        dev <- sum(dev.resids(y, mu, weights))
        if (control$trace) 
            cat("Deviance =", dev, "Iterations -", iter, "\n")
        boundary <- FALSE
        if (is.na(dev) || any(is.na(coef))) {
            warning("Step size truncated due to divergence")
            ii <- 1
            while ((is.na(dev) || any(is.na(coef)))) {
                if (ii > control$maxit) 
                  stop("inner loop 1; can't correct step size")
                ii <- ii + 1
                coef <- (coef + coefold)/2
                eta <- drop(x %*% coef)
                mu <- linkinv(eta <- eta + offset)
                dev <- sum(dev.resids(y, mu, weights))
            }
            boundary <- TRUE
            if (control$trace) 
                cat("New Deviance =", dev, "\n")
        }
        if (!(valideta(eta) && validmu(mu))) {
            warning("Step size truncated: out of bounds.")
            ii <- 1
            while (!(valideta(eta) && validmu(mu))) {
                if (ii > control$maxit) 
                  stop("inner loop 2; can't correct step size")
                ii <- ii + 1
                coef <- (coef + coefold)/2
                mu <- linkinv(eta <- eta + offset)
            }
            boundary <- TRUE
            dev <- sum(dev.resids(y, mu, weights))
            if (control$trace) 
                cat("New Deviance =", dev, "\n")
        }
        if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
            conv <- TRUE
            break
        }
        else {
            devold <- dev
            coefold <- coef
        }
    }
    if (!conv) 
        warning("Algorithm did not converge")
    if (boundary) 
        warning("Algorithm stopped at boundary value")
    eps <- 10 * .Machine$double.eps
    if (family$family == "binomial") {
        if (any(mu > 1 - eps) || any(mu < eps)) 
            warning("fitted probabilities numerically 0 or 1 occurred")
    }
    if (family$family == "poisson") {
        if (any(mu < eps)) 
            warning("fitted rates numerically 0 occurred")
    }
    if (fit$rank != nvars) {
        coef[fit$pivot][seq(fit$rank + 1, nvars)] <- NA
    }
    xxnames <- xnames[fit$pivot]
    residuals <- rep(NA, nobs)
    residuals[good] <- z - (eta - offset)[good]
    fit$qr <- as.matrix(fit$qr)
    nr <- min(sum(good), nvars)
    if (nr < nvars) {
        Rmat <- diag(nvars)
        Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
    }
    else Rmat <- fit$qr[1:nvars, 1:nvars]
    Rmat <- as.matrix(Rmat)
    Rmat[row(Rmat) > col(Rmat)] <- 0
    names(coef) <- xnames
    colnames(fit$qr) <- xxnames
    dimnames(Rmat) <- list(xxnames, xxnames)
    names(residuals) <- ynames
    names(mu) <- ynames
    names(eta) <- ynames
    wt <- rep(0, nobs)
    wt[good] <- w^2
    names(wt) <- ynames
    names(weights) <- ynames
    names(y) <- ynames
    names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("", sum(good) - 
        fit$rank))
    wtdmu <- if (intercept) 
        sum(weights * y)/sum(weights)
    else linkinv(offset)
    nulldev <- sum(dev.resids(y, wtdmu, weights))
    n.ok <- nobs - sum(weights == 0)
    nulldf <- n.ok - as.integer(intercept)
    resdf <- n.ok - fit$rank
    aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
    list(coefficients = coef, residuals = residuals, fitted.values = mu, 
        effects = fit$effects, R = Rmat, rank = fit$rank, qr = fit[c("qr", 
            "rank", "qraux", "pivot", "tol")], family = family, 
        linear.predictors = eta, deviance = dev, aic = aic.model, 
        null.deviance = nulldev, iter = iter, weights = wt, prior.weights = weights, 
        df.residual = resdf, df.null = nulldf, y = y, converged = conv, 
        boundary = boundary)
}
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._