[Rd] inconsistent handling of factor, character, and logical predictors in lm()

Fri Aug 30 20:11:29 CEST 2019

Dear R-devel list members,

I've discovered an inconsistency in how lm() and similar functions handle logical predictors as opposed to factor or character predictors. An "lm" object for a model that includes factor or character predictors includes the levels of a factor or unique values of a character predictor in the $xlevels component of the object, but not the FALSE/TRUE values for a logical predictor even though the latter is treated as a factor in the fit.

For example:

------------ snip --------------

> m1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
> m1$xlevels
$Species
[1] "setosa"     "versicolor" "virginica" 

> m2 <- lm(Sepal.Length ~ Sepal.Width + as.character(Species), data=iris)
> m2$xlevels
$`as.character(Species)`
[1] "setosa"     "versicolor" "virginica" 

> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3$xlevels
named list()

> m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"), 
    data = iris)

Coefficients:
               (Intercept)                 Sepal.Width  I(Species == "setosa")TRUE  
                    3.5571                      0.9418                     -1.7797  

------------ snip --------------

I believe that the culprit is .getXlevels(), which makes provision for factor and character predictors but not for logical predictors:

------------ snip --------------

> .getXlevels
function (Terms, m) 
{
    xvars <- vapply(attr(Terms, "variables"), deparse2, 
        "")[-1L]
    if ((yvar <- attr(Terms, "response")) > 0) 
        xvars <- xvars[-yvar]
    if (length(xvars)) {
        xlev <- lapply(m[xvars], function(x) if (is.factor(x)) 
            levels(x)
        else if (is.character(x)) 
            levels(as.factor(x)))
        xlev[!vapply(xlev, is.null, NA)]
    }
}

------------ snip --------------

It would be simple to modify the last test in .getXlevels to 

	else if (is.character(x) || is.logical(x))

which would cause .getXlevels() to return c("FALSE", "TRUE") (assuming both values are present in the data). I'd find that sufficient, but alternatively there could be a separate test for logical predictors that returns c(FALSE, TRUE).

I discovered this issue when a function in the effects package failed for a model with a logical predictor. Although it's possible to program around the problem, I think that it would be better to handle factors, character predictors, and logical predictors consistently.

Best,
 John

--------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: socialsciences.mcmaster.ca/jfox/