[Rd] problems with [.terms

Sat Aug 21 15:45:43 CEST 2021

The survival package uses [.terms a fair bit, in particular it makes use of the index 
returned in the 'specials' attribute, but the base R code has at least two problems.
First, it does not account for offset terms, if present, and also fails for a formula such 
as  y ~ age + (sex=='male').   Yes, the user should have used I(sex=='male'), but they 
very often don't and the formula works fine without it, but not [.terms.     Users of 
coxph regularly remind me of these flaws :-)

I first reported this in a bugzilla in 2016.   I think that I (finally) have a working fix 
for all the issues.   This is found below in an Sweave document (which is short enough 
that I've pasted it directly in) along with some further discussion.   Three particular 
questions are 1: my solution for (sex=="male") is a bit of a hack, but it works and I 
don't see anything better that is simple, and 2: we should then add documentation that 
both [.terms and drop.terms also remove any offset() terms.  (This is currently true 
anyway.) and 3. Whether this will break other code the currently works around the flaws.

If this looks good I could try to create a formal patch, though that is not something I 
have done before.

Terry T.

------------------
\documentclass [11pt]{article}
\newcommand{\code}[1]{\texttt{#1}}

\title{Subscripting a terms object}
\author{Terry Therneau}
\begin{document}
\maketitle
<<echo=FALSE>>=
options(continue=' ')
@

\section{Introduction}
The survival package makes use of the the subscripting method for terms.
Consider the following code:
<<test1>>=
library(survival)
fit1 <- coxph(Surv(time, status) ~ age + strata(inst), data=lung)
Terms <- terms(fit1)
@

In a Cox model the strata term acts as a per-institution intercept, but it
is not a part of the $X$ matrix.
There are two strategies:
\begin{itemize}
   \item If there are strata by covariate interactions, the strata columns
     are removed from the $X$ matrix after the call to model.matrix.
     They (and the intercept) are necessary to correctly form interactions.
   \item If there are no strata by covariate interactions, the strata term is
     removed from the terms object before calling model.matrix.
     Some models may have thousands of strata, a nested case-control design for
     instance, so this is important for efficiency.
\end{itemize}

The approach is to add \code{specials='strata'} to our call to \code{terms},
which causes any strata call to be marked: the \code{specials} attribute
of the terms object will then contain 3 for the example above (the specials
index counts the response),
and \code{Terms[-2]} will create a version without the strata term.

\section{Errors in [.terms}
Our problem is that the subscripting operator for a terms object fails in 2
cases.
Consider the following increasingly complex formulas:

<<case1>>=
library(survival)
f1 <- model.frame(terms(Surv(time, status) ~ ph.ecog + strata(inst) + age,
                specials="strata"), data=lung)
f2 <- model.frame(terms(Surv(time, status) ~ offset(sex) + age +
                                strata(ph.ecog), specials="strata"), data= lung)
f3 <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) + strata(inst) +
                             (wt.loss + meal.cal)*sex, specials="strata"), lung)
test1 <- terms(f1)
test2 <- terms(f2)
test3 <- terms(f3)
@

The root problem is the need for multiple subscripts.  Consider \code{test1}.
\begin{itemize}
   \item The object itself is a formula of length 5, with `~' as the first
     element.
   \item Attributes are
     \begin{itemize}
       \item The variables and predvars attributes are call objects, each a list()
         with 4 elments: the response and all 3 predictors.
       \item The factors attribute has 4 rows and 2 columns, with row labels
         corresponding to the \code{variables} list.
       \item The specials, response, and offset (if present) attributes give
         the index of those terms in the variables attribute.
       \item The term.labels and order attributes omit the resonse and the offset,
         so have length 2.
       \item The dataClasses attribute is a character vector of length 4.
     \end{itemize}
\end{itemize}
So the ideal result of  term1[remove the specials] would use subscript of
\begin{itemize}
   \item \code{[-5]} on the formula itself, variables and predvars attributes
   \item \code{[-2]} for term.labels
   \item \code{[-4 , -2, drop=FALSE]} for the factor attribute
   \item \code{[-2]} for order attribute
   \item \code{[-4]} for the dataClasses attribute
\end{itemize}

That will recreate the formula that ``would have been'' had there been no
strata term.  Now look at the first portion of the code in models.R

<<>>=
`[.terms` <- function (termobj, i)
{
     resp <- if (attr(termobj, "response")) termobj[[2L]]
     newformula <- attr(termobj, "term.labels")[i]
     if (length(newformula) == 0L) newformula <- "1"
     newformula <- reformulate(newformula, resp, attr(termobj, "intercept"),
                               environment(termobj))
     result <- terms(newformula, specials = names(attr(termobj, "specials")))

     # Edit the optional attributes
}
@

The use of reformulate() is a nice trick, and correctly creates all the
attributes except predvars and dataClasses.  However, the index reported in
the specials attribute is generated with reference to the variables
attribute, or equivalently the row names of factors, not with respect to the
term.labels attribute; the latter lacks the response and any offset terms.
Thus our code works for test1 but fails for test2: specials points to the wrong
variable in term.labels.

The reformulate trick breaks in another way in test3 due to the
\code{(ph.ecog==1)} term.
In the term.labels attribute the parentheses disappear, and
the result of the reformulate call is not a proper formula.  The + binds
tighter than == leading to an error message that will confuse most users.
We can argue, and I probably would, that the user should have used
\code{I(ph.ecog==1)}.
But they didn't, and without the I() it is a legal
formula, or at least one that currently works.
Fixing this issue is a harder, since only the underlying formula retains the
necessary syntax.
Short of deparsing the formula, a hack that appears to work is to surround every
term with a set of parenthesis.
@

The impact of an offset term was overlooked in second portion of the
subscript routine as well, i.e., the ``edit optional attribute'' section which
attempts to amend the predvars and dataClasses attributes.

Here is an updated subscript routine which works for the examples above.
<<newsub>>=
mytermsub <- function (termobj, i)
{
     # In the terms object for
     #   model.frame(mpg ~ cyl + offset(-.04*disp) + wt*factor(carb), mtcars)
     # The subscript 'i' in the call counts variables on the right hand side,
     #   to drop "wt" use a subscript of -3.
     # Using reformulate() with term.labels is the primary strategy, since the
     #   latter includes all the interactions.  But the offset is missing from
     #   term.labels, so we have to be more clever with indexing.
     rvar <- if (attr(termobj, "response") ==1) termobj[[2L]]
     j <- seq.int(length(attr(termobj, "term.labels")) +
                  length(attr(termobj, "offset")))

     if (!is.null(attr(termobj, "offset"))) {
         # k = where offset() would have appeared in term.labels, before removals
         k <- attr(termobj, "response") - attr(termobj, "offset")
         index1 <- match(j, j[k], nomatch=0)[i]
     } else index1 <- j[i]
     newformula <- attr(termobj, "term.labels")[index1]
     # Adding () around each term is for a formula with  + (sex=='male') in the
     #  formula.
     newformula <- reformulate(paste0("(", newformula, ")"), response= rvar,
                               intercept = attr(termobj, "intercept"),
                               env = environment(termobj))
     if (length(newformula) == 0L) newformula <- "1"

     # addition of an extra specials label causes no harm
     result <- terms(newformula, specials = names(attr(termobj, "specials")))

     # now add back the predvars and dataClasses attributes; which do contain
     # the response and offset.
     index2 <- j[i]
     if (attr(termobj, "response")==1) index2 <- c(1, index2 +1)
     # if 'i' drops an interaction it won't change predvars or dataClasses
     index2 <- index2[index2 <= length(rownames(attr(termobj, "factors")))]
     if (!is.null(attr(termobj, "offset")))
         index2 <- index2[-attr(termobj, "offset")]
     if (!is.null(attr(termobj, "predvars")))
         attr(result, "predvars") <- attr(termobj, "predvars")[c(1, index2 +1)]
     if (!is.null(attr(termobj, "dataClasses")))
         attr(result, "dataClasses") <- attr(termobj, "dataClasses")[index2]

     result
}
@

Now test this out by dropping the strata and offset terms.
<<testit>>=
f1b <- model.frame(terms(Surv(time, status) ~ ph.ecog + age,
                specials="strata"), data=lung)
f2b <- model.frame(terms(Surv(time, status) ~ age,
                                specials="strata"), data= lung)
f3b <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) +
                             (wt.loss + meal.cal)*sex, specials="strata"), lung)
all.equal(attributes(terms(f1b)), attributes(mytermsub(test1, -2)))
all.equal(attributes(terms(f2b)), attributes(mytermsub(test2, -3)))
all.equal(attributes(terms(f3b)), attributes(mytermsub(test3, -3)))
@
The formula itself changes due to the extra parentheses, hence the check of
only the attributes.

The drop.terms function shares much of the same code.
<<drop>>=
drop.terms <- function(termobj, dropx = NULL, keep.response = FALSE)
{
     if (is.null(dropx)) return(termobj)
     if(!inherits(termobj, "terms"))
         stop(gettextf("'termobj' must be a object of class %s",
                       dQuote("terms")), domain = NA)
     rvar <- if (keep.response & attr(termobj, "response") ==1) termobj[[2L]]
     j <- seq.int(length(attr(termobj, "term.labels")) +
                  length(attr(termobj, "offset")))

     if (!is.null(attr(termobj, "offset"))) {
         # k = where offset() would have appeared in term.labels
         k <- attr(termobj, "response") - attr(termobj, "offset")
         index1 <- match(j, j[k], nomatch=0)[-dropx]
     } else index1 <- j[-dropx]
     newformula <- attr(termobj, "term.labels")[index1]
     # Adding () around each term is for a formula with  + (sex=='male')
     newformula <- reformulate(paste0("(", newformula, ")"),
                               response = rvar,
                               intercept = attr(termobj, "intercept"),
                               env = environment(termobj))
     if (length(newformula) == 0L) newformula <- "1"

     # addition of an extra specials label causes no harm
     result <- terms(newformula, specials = names(attr(termobj, "specials")))

     # now add back the predvars and dataClasses attributes; which do contain
     # the response and offset.
     index2 <- j[-dropx]
     if (attr(termobj, "response")==1) index2 <- c(1, index2 +1)
     # if 'i' drops an interaction it won't change predvars or dataClasses
     index2 <- index2[index2 <= length(rownames(attr(termobj, "factors")))]
     if (!is.null(attr(termobj, "offset")))
         index2 <- index2[-attr(termobj, "offset")]
     if (!is.null(attr(termobj, "predvars")))
         attr(result, "predvars") <-attr(termobj, "predvars")[c(1, index2 +1)]
     if (!is.null(attr(termobj, "dataClasses")))
         attr(result, "dataClasses") <- attr(termobj, "dataClasses")[index2]

     result
}
@

\end{document}

	[[alternative HTML version deleted]]