[R] bugs and misfeatures in polr(MASS).... fixed!

Wed Nov 3 01:03:10 CET 2010

In polr.R the (several) functions gmin and fmin contain the code

>         theta <- beta[pc + 1L:q]
>         gamm <- c(-100, cumsum(c(theta[1L], exp(theta[-1L]))), 100)

That's bad. There's no reason to suppose beta[pc+1L] is larger than
-100 or that the cumulative sum is smaller than 100. For practical
datasets those assumptions are frequently violated, causing the
optimization to fail. A work-around is to center the explanatory
variables. This helps keep the zetas small.

The correct approach is to use the values -Inf and Inf as the first
and last cut points. The functions plogis, dnorm, etc all behave
correctly when the input is one of these values. The dgumbel function
does not, returning NaN for -Inf. Correct this as follows

dgumbel <- function (x, loc = 0, scale = 1, log = FALSE)
{
    x <- (x - loc)/scale
    d <- log(1/scale) - x - exp(-x)
    d[is.nan(d)] <- -Inf                # -tjb
    if (!log) exp(d) else d
}

The documentation states

>    start: initial values for the parameters.  This is in the format
>           'c(coefficients, zeta)': see the Values section. 

The relevant code is

>       s0 <- if(pc > 0) c(start[seq_len(pc+1)], diff(start[-seq_len(pc)]))
>       else c(start[1L], diff(start))

This doesn't take the logs of the differences as required to repose
the zetas into the form used in the optimization. The fix is
obvious. polr.fit has the same problem which is responsible for
summary() frequently failing when the Hessian is not provided.

I'm not convinced the t values reported by summary() are
reliable. I've noticed that a one dimensional linear transformation
the independent variables can cause the reported t values to change by
a factor of more than 100, which doesn't seem right.

        --tjb