\documentclass[nojss]{jss} %\documentclass[article]{jssNoName} % \VignetteIndexEntry{Exact and Asymptotic Weighted Logrank Tests for Interval Censored Data: The interval R package} % \VignetteKeyword{Interval censoring} % \VignetteKeyword{Hypothesis test} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Michael P. Fay\\National Institute of Allergy \\ and Infectious Diseases \And Pamela A. Shaw\\National Institute of Allergy \\ and Infectious Diseases} \title{Exact and Asymptotic Weighted Logrank Tests for Interval Censored Data:\\ The \pkg{interval} \proglang{R} package} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Michael P. Fay, Pamela A. Shaw} %% comma-separated \Plaintitle{Exact and Asymptotic Weighted Logrank Tests for Interval Censored Data: The interval R package} %% without formatting \Shorttitle{Weighted Logrank Tests for Interval Censored Data} %% a short title (if necessary) %% an abstract and keywords \Abstract{ This paper is a version of \cite{Fay+Shaw:2010} published in the {\it Journal of Statistical Software}, and the versions are essentially the same except formating and changes to initfit defaults which may change some timings. For right-censored data perhaps the most commonly used tests are weighted logrank tests, such as the logrank and Wilcoxon-type tests. In this paper we review several generalizations of those weighted logrank tests to interval-censored data and present an \proglang{R} package, \pkg{interval}, to implement many of them. The \pkg{interval} package depends on the \pkg{perm} package, also presented here, which performs exact and asymptotic linear permutation tests. The \pkg{perm} package performs many of the tests included in the already available \pkg{coin} package, and provides an independent validation of \pkg{coin}. We review analysis methods for interval-censored data, and we describe and show how to use the \pkg{interval} and \pkg{perm} packages. } \Keywords{logrank test, Wilcoxon test, exact tests, network algorithm, \proglang{R}} \Plainkeywords{interval-censored data, logrank test, Wilcoxon test, exact tests, network algorithm, R} %% without formatting %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} %% The address of (at least) one author should be given %% in the following format: \Address{ Michael P. Fay \\ National Institute of Allergy and Infectious Diseases \\ 6700B Rockledge Drive, MSC 7630 \\ Bethesda, MD 20892-7630 \\ E-mail: \email{mfay@niaid.nih.gov}\\ URL:\url{http://www3.niaid.nih.gov/about/organization/dcr/BRB/staff/michael.htm} \\ Pamela A. Shaw \\ National Institute of Allergy and Infectious Diseases \\ 6700B Rockledge Drive, MSC 7630 \\ Bethesda, MD 20892-7630 \\ E-mail: \email{shawpa@niaid.nih.gov}\\ URL:\url{http://www3.niaid.nih.gov/about/organization/dcr/BRB/staff/pam.htm} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \SweaveOpts{concordance=FALSE} %% include your article here, just as usual %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. %\section[About Java]{About \proglang{Java}} %% Note: If there is markup in \(sub)section, then it has to be escape as above. <>= options(prompt = "R> ", continue = "+ ") @ \section{Introduction} In their original paper introducing the logrank test, \cite{Peto:1972} outlined how to perform permutation-based logrank tests for interval-censored data. Later, \cite{Fink:1986} studied the application of the logrank test to interval-censored data in detail using a grouped continuous model. These logrank tests are appropriate for comparing treatment groups when the response is time to an event and time may only be known to fall into an interval. An example is time to progression-free survival \citep[see e.g., ][]{Frei:2007}, where patients are monitored intermittently and progression is known to have occurred only to within the time since the last visit. A naive approach, if one has interval-censored data, is to simply impute the mid-point of the intervals and perform the usual right-censored weighted logrank tests. \cite{Law:1992} studied this naive approach \citep[][use the term `doubly censored observations' to mean interval-censored observations]{Law:1992}, and showed through simulation that when the censoring mechanism is different between the two groups, the type I error of a nominal $0.05$ test was in one case estimated to be as high as $0.19$. Thus, there is clearly a need for the tests specifically designed for interval-censored data. Despite the demonstrated need for these methods and despite the development of several methods for generalizing the logrank test to interval-censored data, these tests are rarely used in the medical literature. This may be due to lack of knowledge about the interval-censored methods, but it also may be due to the lack of widely available software to test interval-censored data nonparametrically. In this paper we describe an \proglang{R} package, called \pkg{interval}, to perform weighted logrank tests (WLRT) (including a generalization of the Wilcoxon-Mann-Whitney test) for interval-censored data. The open source freeware \proglang{R} \citep{R} easily accommodates packages, and all \proglang{R} packages mentioned in this paper are available from the Comprehensive \proglang{R} Archive Network at \url{http://CRAN.R-project.org/}. For each type of rank score (e.g., logrank-type or Wilcoxon-type) the \pkg{interval} package allows three methods for creating tests: (1) score tests, (2) permutation tests derived from the score statistics in the grouped continuous model (GCM), both described in \cite{Fay:1999}, and (3) multiple imputation tests as described in \cite{Huan:2008}. The $p$-values from the permutation tests may be calculated either asymptotically using a permutational central limit theorem or exactly using either complete enumeration, a network algorithm (for the two-sample case only), or Monte Carlo resampling. We know of no readily available software to perform these tests (or other different generalizations of the WLRTs to interval-censored data), except for the \proglang{S-PLUS} functions (written by the first author) upon which the \pkg{interval} package is based \citep{Fayinterval}. In Section~\ref{sec-background} we give a brief background on the analysis of interval-censored data, with emphasis on nonparametric maximum likelihood estimation of the distribution and generalizations of the WLRTs for interval-censored data. We review different forms of the likelihood as well as the different methods for carrying out the WLR tests. This section reviews what is known about the asymptotic validity of these tests for different interval-censoring settings. With this background, we hope to give intuition on the methods for users primarily familiar with the usual right-censored logrank tests and also provide information to guide the applied researcher to make an appropriate choice for the test for their setting of interest. The mathematical formulation of the WLRTs used in \pkg{interval} package are outlined in Section~\ref{sec-details}. Section~\ref{sec-application} provides step-by-step instructions for how to use the \pkg{interval} package to perform data analysis. This application section demonstrates the use of two main functions of \pkg{interval}: \code{icfit} which provide nonparametric maximum likelihood estimators (NPMLEs) of the survival distribution and \code{ictest} which performs weighted logrank tests. As explained in Section~\ref{sec-background}, all implementations of the WLRTs require first fitting an NPMLE of the overall survival distribution and, for some kinds of inferences, permutation methods may be used. Because of this latter fact, the \pkg{interval} depends on the \pkg{perm} package which performs exact and asymptotic linear permutation tests. Note that this package is mostly redundant because nearly all of the tests performed by \pkg{perm} are available in the package \pkg{coin} \citep{Hoth:2006,coin2}, and the exact tests are generally faster in \pkg{coin}. The redundancy provides independent software to check the results from \pkg{perm}. In Section~\ref{sec-perm} we go into detail about the design of the \pkg{perm} package and how it differs from the \pkg{coin} package. In Section~\ref{sec-interval} we detail the design of the \pkg{interval} package. The \pkg{interval} package uses an expectation-maximization (EM) style algorithm to find the NPMLE of the overall distribution where the initial estimator of the distribution may come from a function from another package. The default initial estimate uses a function whose main calculation engine uses the \code{computeMLE} function from \pkg{MLEcens}, an available package for bivariate interval-censored data that is also applicable for univariate interval-censored data \citep{MLEcens}. Additionally, the \pkg{Icens} package may be used to calculate an initial distribution estimate, and we have designed the \pkg{interval} package to be able to input NPMLEs from the \pkg{Icens} package \citep{Icens}, using the class structure designed there. Further, we show how the \code{wlr_trafo} function of the \pkg{interval} package may be called seamlessly from within \pkg{coin} to perform exact permutation tests on interval-censored data possibly more quickly than using the default \pkg{perm} package. \section{Background on interval-censored data analysis methods} \label{sec-background} \subsection{Censoring assumptions} \label{sec-censoring.assumptions} Interval censored data arise when the response of interest is a time-to-event variable, but instead of the response time, we only observe whether or not the event has yet occurred at a series of assessment times. For example, suppose the event is time until first disease occurrence, and we can assume that this disease will not spontaneously be cured without treatment (e.g., HIV). The response data may be a series of assessment times on the individual corresponding to times when blood samples used for detecting the disease were taken, and at each blood sample we determine if the disease is detected or not. Typically, we do not keep information about all the assessment times, but only record the last disease-free assessment time and the first assessment time where the disease was detected. When all the assessment times (not just the two that we keep information about) are independent of the event time we say that the censoring is non-informative. In the example, the censoring is non-informative if the subjects are not more or less likely to get a blood sample taken if they have the disease. More difficult situations where the assessment times may depend on the event can cause informative censoring and are not discussed in this paper or handled by the software described here. See \cite{Zhan:2007} and the references therein for methods for informative interval-censoring. \subsection{Nonparametric estimation of survival distributions} \label{sec-npmle} With right-censored data often we will plot the NPMLE of the survival function which in that case is called the Kaplan-Meier or product-limit estimator \citep[see][]{Kalb:1980}. The Kaplan-Meier estimator is typically called ``undefined'' after the last observation if that observation is right-censored. This is because the NPMLE is not unique in that space, as changes in the survival distribution after that last censored observation do not effect the likelihood of the observed data. A similar and much more common non-uniqueness problem occurs with interval-censored data. Consider a simple case where every subject is assessed at day 0, day 7, day 14, and so on, for every 7 days until the study ends at 20 weeks and suppose that at least one failure is observed every week. Then no information about the survival time between days 0 and 7 is available and if two survival functions match at days 0, 7, 14, etc. they will give the same likelihood value. For this case, we represent the class of NPMLEs of the survival function by the probability mass associated with the intervals $(0, 7]$, $(7, 14]$, $\ldots$, $(133, 140]$. This class does not represent one unique survival function, but any survival function within the class will maximize the likelihood. The \pkg{interval} package represents the class of NPMLEs of the survival distribution by putting gray rectangles in the area where the function is not unique (see Section~\ref{sec-application}). This type of non-uniqueness is called {\it representational non-uniqueness } by \cite{Gent:2002}. For ease of exposition (and in line with most of the literature) we will call the class of NPMLEs of the distribution `the' NPMLE of the distribution. For studies with irregular assessment times between individuals, we also represent the NPMLE of the survival distribution as a vector of probability masses and a set of intervals on which the masses are distributed; however, the determination of that set of intervals is not as obvious. \cite{Turn:1976} showed that the NPMLE of the survival distribution is only unique up to a set of intervals, which may be called the innermost intervals \citep[these intervals are known also as the Turnbull intervals, or the regions of the maximal cliques, see ][]{Gent:2001}. \cite{Turn:1976} showed how the NPMLE can then be found by the self-consistent algorithm, which is a special case of the E-M algorithm. Convergence of the E-M algorithm does not guarantee convergence to the global maximum \citep[see e.g., ][]{Tann:1996}; however, \cite{Gent:1994} showed that for interval-censored data, a self-consistent estimate (i.e., an estimate that results at convergence of the self-consistent algorithm) is the NPMLE if it meets certain conditions, the Kuhn-Tucker conditions. \cite{Gent:1994} proposed a ``polishing algorithm'' whereby we provisionally set the estimated probability mass in some intervals to zero if they are below some bound, then check the Kuhn-Tucker conditions to make sure that those values are truly zeros at the NPMLE. If those conditions are not met then a small probability is added back on and the E-M iterations continue. Convergence may be defined when the maximum reduced gradient is less than some minimum error, and the Kuhn-Tucker conditions are met \citep[see][]{Gent:1994}. For univariate interval-censored data, once the innermost intervals (i.e., regions of the maximal cliques) are determined, the probability assigned to those intervals which maximizes the likelihood is unique \citep[see e.g., ][Lemma 4]{Gent:2002}. For bivariate interval-censored data, this uniqueness of probabilities may not hold and that situation is called {\it mixture non-uniqueness} \citep[see][]{Gent:2002}. There are many other algorithms for calculating the NPMLE, and the \pkg{Icens} package provides many of the different algorithms described in \cite{Gent:2001}. The default calculation of the NPMLE in the \pkg{interval} uses the E-M algorithm with the polishing algorithm of \cite{Gent:1994} after calculating an initial estimator of the NPMLE using the \code{computeMLE} function of the \pkg{MLEcens} package. Although \pkg{MLEcens} was designed for bivariate interval-censored data, once the data are reduced to the set of maximal cliques, the calculation is the same as for the univariate interval-censored case. The optimization step (going from the set of maximal cliques to the NPMLE) in \pkg{MLEcens} uses the support reduction algorithm of \cite{Groe:2008}. The advantage of the initial estimator is speed, and for completeness in \pkg{interval} the Kuhn-Tucker conditions are checked. \subsection{Overview of weighted logrank tests} \label{sec-app.overview} For right-censored data, the logrank test is a score test for the equality of survival distributions under the proportional hazards model, thus it is an efficient test when the proportional hazards assumption holds. There are several different versions of the logrank test that have been developed \citep[see ][]{Kalb:1980}. In particular, the likelihood could be the marginal likelihood of the ranks, the partial likelihood, or the grouped continuous model. Further, the variance could be estimated by the Fisher's information from the likelihood, by martingale methods \citep[see ][]{Flem:1991} or by permutation methods. The differences between the several different versions of the logrank test are often not a focus of applied statisticians; however, in this paper since we are emphasizing validation of software, these slight differences need to be considered to avoid confusion and will be discussed in detail in later sections \citep[see e.g., ][]{Call:2003}. In addition to the logrank test, which is a WLRT with constant weight of 1 (or approximately 1), an important WLRT is the one that generalizes the Wilcoxon-Mann-Whitney test. We will call these latter tests Wilcoxon-type tests, but they are known by other names (e.g., Prentice-Wilcoxon test, proportional odds model, Harrington-Fleming $G^{\rho}$ class with $\rho = 1$). Similar to the logrank test, the Wilcoxon-type tests also have been derived using different likelihoods and using different variances. The important point for the applied researcher is that the Wilcoxon-type tests emphasize early differences in distributions (when there are more people at risk) more than the later differences (when there are fewer people at risk), while the logrank test gives constant (or near constant) weights when the test is written in the weighted logrank form (see Equation~\ref{eq:weighted.logrank}), which implies comparatively more weight to later differences than the Wilcoxon-type test. We now summarize the next two subsections which detail the extension of logrank tests to interval-censored data. Both likelihoods that may be applied to interval-censored data, the likelihood under the grouped continuous model (LGCM) and the marginal likelihood of the ranks (MLR), should give similar answers. The permutation form of the tests are generally preferred over the score test forms when using the LGCM, since permuting allows exact inference when the censoring is not related to the covariate (e.g., treatment), and the permutation results avoid theoretical problems of the score test \citep[see below and ][]{Fay:1996}. When the censoring is related to treatment and there are few inspection times compared to the number of subjects, the usual score test is recommended since it is asymptotically valid in this case. Now we give some more details on the different tests for interval-censored data. \subsection{Choosing the likelihood for WLR tests} There has not been a single obvious approach for creating a likelihood to use for interval-censored data. \cite{Self:1986} used the marginal likelihood of the ranks (MLR). This has the advantage that we need not estimate the baseline distribution (or equivalently the baseline hazard). The disadvantage of the MLR is that it is difficult to calculate. Note that even in the right-censored case with ties, the likelihood is usually only approximated \citep[see][pp.~74--75]{Kalb:1980}. \cite{Satt:1996} introduces a stochastic approximation to the MLR using Gibbs sampling for the proportional hazards model and it is generalized to proportional odds and other models by \cite{Gu:2005}. \cite{Fink:1986} \citep[see also][]{Fay:1996, Fay:1999} used the likelihood under the grouped continuous model (LGCM). In the LGCM, we estimate a baseline distribution, which is a monotonic function estimated at each observation point, and the function's value at each of those points is a nuisance parameter that must be estimated. Because there are so many nuisance parameters and the number of them may depend on the sample size, the standard likelihood-based tests (i.e., score test, Wald test, and likelihood ratio test) may not follow the usual theory \citep[see][]{Fay:1996}. Note, however, that the permutation test formed from the scores of the LGCM is theoretically justified and is known to be a valid test when the censoring is unrelated to the covariate (see the following section). We discuss the computational issues of the LGCM in the next section. For the non-censored case, \cite{Pett:1984} studied the two likelihoods and showed that both likelihoods give asymptotically equivalent score tests as long as either the number of categories of response is fixed, or the number of categories does not increase too quickly compared to the total sample size. Pettitt concluded \citep[see][section 5.1]{Pett:1984} that the score test for the MLR was more efficient (i.e., had greater power) than the score test for the LGCM; however, Pettitt did not consider the permutation form of the test using the LGCM. Finally, when imputation methods are used then martingale methods may be used \citep[see][and below]{Huan:2008}. The \pkg{interval} package allows the user to choose between the LGCM and imputation/martingale likelihood methods through the score option within \code{ictest}, as will be demonstrated in section 4. The MLR is not supported within the \pkg{interval} package at this time. \subsection{Choosing the inferential method for WLR tests} \label{sec-inferential.method} Once the likelihood is chosen, and the associated efficient score (the first derivative of the loglikelihood with respect to the parameter of interest evaluated under the null, i.e., the $U$ in Equation~\ref{eq:perm} below) is calculated, then the distribution of that score under the null must be estimated so that the $p$-value corresponding to the test statistic can be calculated. There are several methods for doing this, but the three most common are using asymptotic methods with the observed Fisher's information, which is commonly known as the score test, using permutation methods, or using multiple imputation \citep{Huan:2008}. When the censoring mechanism is the same for all treatment groups, the permutation test is known to be valid for either the MRL or the LGCM. In this case of equal censoring, the score test is only known to be asymptotically valid using the MRL; using the LGCM we require the additional assumption that the number of observation times remains fixed as the sample size goes to infinity \citep[see][for a discussion of this issue]{Fay:1996}. %\cite{Brow:1984} has studied in the right-censored case the choice %between the Mantel-Haenszel variance %(I now do not think this is equivalent to the score variance, although maybe it is approximately so) and the permutation test. %\cite{Brow:1984} shows that for equal sample sizes and when the score is large in absolute value %that the score variance %tends to underestimate the variance. Further, when the sample sizes %are unequal the permutation variance tends to overestimate the %variance. Because of these two facts, \cite{Brow:1984} recommends the %permutation variance. When there is unequal censoring then the theory for the permutation method is not formally met. Thus, we have previously suggested that with unequal censoring the score variance is better \citep[see e.g., ][p. 820 for the interval-censoring case]{Fay:1996}. Further work needs to be done to explore unequal interval-censoring; however, we can get some assurance for the practical use of the permutation method from the special case of right-censored data, where it has been shown through simulation that the permutation method usually controls the type I error except in extreme cases of unequal censoring and very unbalanced sample sizes between groups \citep{Hein:2003}. \cite{Hein:2003} also developed an imputation method that controlled the type I simulated error for all cases, and we discuss other related imputation methods applied to interval-censored data next. Another strategy to create WLRT for interval-censored data is to impute right-censored data from the interval-censored data and then properly adjust the variance. \cite{Huan:2008} improved on some earlier attempts at this variance adjustment after imputation. This appears to be a reasonable strategy, and provides an independent check on the other methods. Since this imputation method is closely related to the within-cluster resampling of \cite{Hoff:2001}, we use `wsr' (for within subject resampling) to label these methods in the \pkg{interval} package. On each imputation \cite{Huan:2008} only considered the usual martingale derived variance (use \code{method = "wsr.HLY"} in \code{ictest}), while the \pkg{interval} package additionally allows for permutational variance (\code{method = "wsr.pclt"}) and Monte Carlo estimation within each imputation (\code{method = "wsr.mc"}). \subsection{Regression in interval-censored data} This section is provided for completeness, but these methods are not a part of the \pkg{interval} package. For parametric methods, it is straightforward to form the likelihood for interval-censored data under the accelerated failure time model and standard likelihood based methods may be applied (see Equation~\ref{eq:aft}). These methods are provided in the \pkg{survival} package using the \code{survreg} function \citep{survival}. For right-censored data a more common regression method is the semi-parametric Cox proportional hazards regression. In this model the baseline hazard function is completely nonparametric, but does not need to be estimated. The score test from this model is the logrank test. The generalization of the model to interval-censored data typically uses the marginal likelihood of the ranks \citep[see][]{Satt:1996, Gogg:1999}. The only available software for doing these models of which we are aware is an \proglang{S} function (which calls a compiled \proglang{C} program requiring access to a SPARC based workstation) to perform a Monte-Carlo EM algorithm for proportional hazards models described in \cite{Gogg:1999} \citep{SGogg}. Another approach to semi-parametric modeling is to specifically estimate the non-parametric part of the model with a piecewise constant intensity model \citep[see][]{Farr:1996, Cars:1996}. This is the approach taken with the \code{Icens} function in the \pkg{Epi} package \citep{Epi}. \section{Mathematical formulation of the scores for the WLRT} \label{sec-details} In this section, we provide the general form of rank invariant score test on the grouped continuous model, and for each of the three main rank scores available within \code{ictest}: those from the logistic \citep{Sun:1996}, the group proportional hazards \citep{Fink:1986}, and the generalized Wilcoxon Mann Whitney \citep{Fay:1996} models. In the details that follow, we briefly describe the underlying survival model (or hazard model) and the mathematical form of the individual scores. Further details on the derivation of the tests are given in \cite{Fay:1996} and \cite{Fay:1999}. The other rank scores available in \pkg{interval} are also described briefly. Suppose we have $n$ subjects. For the $i$th subject, use the following notation: \begin{description} \item[$x_i$] is the time to event, $X_i$ is the associated random variable. \item[$L_i$] is the largest observation time before the event is known to have occurred. \item[$R_i$] is the smallest observation time at or after the event is known to have occurred. In other words, we know that $x_i \in (L_i, R_i]$. (Note that \pkg{interval} allows the endpoints of each interval to be either included or excluded using \code{Lin} and \code{Rin} options, but for ease of explanation we assume the pattern just described.) We allow $R_i = \infty$ to denote right-censoring. %For exact event times, i.e., $L_i = \lim_{\epsilon \rightarrow 0} %R_i - \epsilon \equiv R_i-0$, we enter %those values into the software using $L_i=R_i$. \item[$z_i$] is a $k \times 1$ vector of covariates. \end{description} Let the ordered potential %\textcolor{red}{potential} observation times be $0=t_0 < t_1 < t_2 < \cdots < t_m < t_{m+1} = \infty$. Partition the sample space by creating $(m+1)$ intervals, with the $j$th interval denoted $I_j \equiv (t_{j-1}, t_j]$. For simplicity, we assume that $L_{i}, R_{i} \in \{ t_{0}, \ldots, t_{m+1} \}$. Let \begin{eqnarray*} \alpha_{ij} & = & \left\{ \begin{array}{cc} 1 & \mbox{ if } L_{i} < t_{j} \leq R_{i} \\ 0 & \mbox{ otherwise } \end{array} \right. \end{eqnarray*} We write the general model of the survival for the $i$th individual as \begin{eqnarray*} Pr ( X_i > t_{j} | z_{i} ) & = & S(t_{j} | z_{i}^{\top} \beta, \gamma) \end{eqnarray*} where $\beta$ is a $k \times 1$ vector of treatment parameters, and $\gamma$ is an $m \times 1$ vector of nuisance parameters for the unknown survival function. In the \pkg{interval} package, there are several different ways we can model $S(t_{j} | z_{i}^{\top} \beta, \gamma)$. Most of these ways use a model closely related to the accelerated failure time (AFT) model. Thus, we begin by defining the AFT model, where the time to event for the $i$th subject, $X_i$, is modeled as \begin{eqnarray} \log(X_i) & = & \mu +z_{i}^{\top} \beta + \sigma \epsilon_i \label{eq:aft} \end{eqnarray} where $\mu$ and $\sigma$ are location and scale parameters and the distribution of $\epsilon_i$ is known to be $F$. In the grouped continuous model, we replace the $\log$ transformation with $g(\cdot)$, an unknown monotonic transformation that absorbs the $\mu$ and $\sigma$ parameters, to get: \begin{eqnarray} g(X_i) & = & z_{i}^{\top} \beta + \epsilon_i \label{eq:g} \end{eqnarray} where here also $\epsilon_i \sim F$ and $F$ is some known distribution (e.g., logistic, normal). Then the model of the survival distribution at time $t$ is \begin{eqnarray} S(t | z_{i}^{\top} \beta, \gamma) & = & 1- F\{ g(X_i) - z_{i}^{\top} \beta \} \label{eq:Sgcm} \end{eqnarray} and in the grouped continuous model, $g(\cdot)$ is described at all the places where the likelihood may change (i.e., at $t_1, \ldots, t_m$) by the vector of nuisance parameters, $\gamma$. The grouped continuous likelihood for interval-censored data is \begin{eqnarray} L & = & \prod_{i=1}^{n} \sum_{j=1}^{m+1} \alpha_{ij} \left[ S(t_{j-1} | z_{i}^{\top} \beta, \gamma ) - S(t_{j} | z_{i}^{\top} \beta, \gamma) \right] = \prod_{i=1}^{n} \left[ S(L_{i} | z_{i}^{\top} \beta, \gamma ) - S(R_{i} | z_{i}^{\top} \beta, \gamma) \right] \label{eq:gc.likelihood} \end{eqnarray} To form the score statistic we take the derivative of $ \log(L)$ with respect to $\beta$ and evaluate it at $\beta=0$. The MLE of the nuisance parameters when $\beta=0$ (in terms of the baseline survival) are the NPMLE of survival, $\hat{S}(t_{j}), j=1, \ldots, m$. For convenience, let $\hat{S}(t_{0}) =1$ and $\hat{S}(t_{m+1}) = 0$, even though these values are known by assumption. We can write the efficient score vector for the parameter $\beta$ \citep[see][]{Fay:1996, Fay:1999} as \begin{eqnarray} U & = & \sum_{i=1}^{n} z_{i} \left( \frac{ \hat{S}'(L_{i}) - \hat{S}'(R_{i}) }{ \hat{S}(L_{i}) - \hat{S}(R_{i}) } \right) %\nonumber \\ & \equiv % & \sum_{i=1}^{n} z_{i} c_{i} \label{eq:perm} \end{eqnarray} where $\hat{S}'(t)$ is the derivative with respect to $\beta$ evaluated at $\beta=0$ and at $g(t)=F^{-1}(1-\hat{S}(t))$, i.e., $\hat{S}'(t)=f \left[ F^{-1} \left\{ 1-\hat{S}(t) \right\} \right]$ and $f$ and $F^{-1}$ are the density and quantile functions of $F$ respectively. When $z_{i}$ is an $k \times 1$ vector of indicators of $k$ treatments, we can rewrite the $\ell$th row of $U$ as \begin{eqnarray} U_{\ell} & = & \sum_{j=1}^{m} w_{j} \left[d_{j \ell}^{*} - \frac{ n_{j\ell}^{*} d_{j}^{*} }{ n_{j}^{*} } \right], \label{eq:weighted.logrank} \end{eqnarray} where \begin{eqnarray*} w_{j} & = & \frac{ \hat{S}(t_{j}) \hat{S}'(t_{j-1}) - \hat{S} (t_{j-1}) \hat{S}'(t_{j}) }{ \hat{S}(t_{j}) \left[ \hat{S}(t_{j-1}) - \hat{S}(t_{j} ) \right] }, \end{eqnarray*} and $d_{j\ell}^{*}$ represents the expected value under the null of the number of deaths in $I_{j}$ for the $\ell$th treatment group, $d_{j}^{*}$ represents the expected value under the null of the total number of deaths in $I_j$, similarly $n_{j \ell}^{*}$ and $n_{j}^{*}$ represent the expected number at risk. We now give the values for $c_i$ (from Equation~\ref{eq:perm}) and $w_i$ (from Equation~\ref{eq:weighted.logrank}) for some different %\textcolor{red}{ hazard? relative risk? survival?} survival models provided in \code{ictest}. Although not developed first, we present the model of \cite{Sun:1996} first because it is the generalization of the logrank test most commonly used for right-censored data. %\subsection{Logrank using Logistic Model} \cite{Sun:1996} modeled the odds of discrete hazards as proportional to $\exp(z_{i}^{\top}\beta)$ \citep[see][]{Fay:1999}, leading to the more complicated survival function: \begin{eqnarray*} \label{SunScore} S(t_j | z_i, \gamma) & = & \prod_{k=1}^{j} \left\{ 1 + \left( \frac{S(t_{k-1}|\gamma) -S(t_k|\gamma)}{S(t_k|\gamma)} \right) \right\}^{-1}. \end{eqnarray*} Here and in the other two models, $S(t_{j} | \gamma)$ is a estimator of survival that does not depend on the covariates $z_i$, and $S(t_{j} | \gamma)$ is nonparametric because the $\gamma$ is $m \times 1$ and there are only $m$ unique time points observed in the data. Denote its estimator $S(t|\hat{\gamma}) \equiv \hat{S}(t)$, which is the NPMLE of the survival function of all the data ignoring covariates. Under the model of \cite{Sun:1996} we get, \begin{eqnarray} c_{i} & = & %\frac{ \sum_{j=1}^{m+1} \alpha_{ij} %\left[ \hat{S}(t_{j}) \left( \sum_{\ell=1}^{j} %\frac{ \hat{S}(t_{\ell-1}) - \hat{S}(t_{\ell}) }{ %\hat{S}(t_{\ell-1}) } \right) - \hat{S}(t_{j-1}) \left( %\sum_{\ell=1}^{j-1} %\frac{ \hat{S}(t_{\ell-1}) - \hat{S}(t_{\ell}) }{ %\hat{S}(t_{\ell-1}) } \right) \right] }{ %\sum_{j=1}^{m+1} \alpha_{ij} \left[ %\hat{S}(t_{j-1}) - \hat{S}(t_{j}) \right] } \label{eq:ci.Logistic} \frac{ \hat{S}(L_i) \log \tilde{S}(L_i) - \hat{S}(R_i) \log \tilde{S}(R_i) }{ \hat{S}(L_i) - \hat{S}(R_i) } \end{eqnarray} where $\tilde{S}(t_j) = \exp \left( - \sum_{\ell=1}^{j} \hat{\lambda}_{\ell} \right)$, and $\lambda_{\ell} = \left\{ \hat{S}(t_{\ell-1}) - \hat{S}(t_{\ell}) \right\} / \hat{S}(t_{\ell-1})$, and \begin{eqnarray*} w_{j} & = & 1. \end{eqnarray*} This model is called from the \pkg{interval} package by the option \code{scores = "logrank1"}. The second model we consider was actually developed first, it is the grouped proportional hazards model introduced by \cite{Fink:1986}, where the survival function is modeled as $S(t_{j} | z_{i}^{\top} \beta, \gamma) = S(t_{j} | \gamma)^{\exp(z_{i}^{\top} \beta)}$. This comes from the model of Equation~\ref{eq:g} when $F$ is the extreme minimum value distribution. Under this grouped proportional hazards model, the $c_i$ values are: %\subsection{Logrank using Grouped Proportional Hazards Model} \begin{eqnarray} c_{i} & = & \left\{ \begin{array}{cc} \frac{ \hat{S}(L_i) \log \hat{S}(L_i) - \hat{S}(R_i) \log \hat{S}(R_i) }{ \hat{S}(L_i) - \hat{S}(R_i) } & \mbox{ for $R_{i} < t_{m+1}$ } \\ & \mbox{ } \\ \log \hat{S}(L_i) & \mbox{ for $R_{i} = t_{m+1} \equiv \infty$ } \end{array} \right. \label{eq:ci.GPH} \end{eqnarray} and \begin{eqnarray*} w_{j} & = & \frac{ \hat{S}(t_{j-1}) \left[ \log \hat{S}(t_{j-1}) - \log \hat{S}(t_{j}) \right] }{\hat{S}(t_{j-1}) - \hat{S}(t_{j})}. \end{eqnarray*} Note that because this model makes a proportional hazards assumption, we call the resulting test a logrank test also and the model is called by \code{scores = "logrank2"} in the \pkg{interval} package. When $\hat{S}(t_{j-1})/\hat{S}(t_j) \approx 1$ then $w_j \approx 1$. %\subsection{Prentice-Wilcoxon using Proportional Odds Model} Next, consider the model proposed by \cite{Peto:1972} \citep[see][]{Fay:1996} giving the Wilcoxon-type test, where the odds are proportional to $\exp( - z_i \beta)$ so that the survival function is \begin{eqnarray*} S(t_j | z_i, \gamma) & = & \left\{ 1 + \left( \frac{ 1 - S(t_j| \gamma)}{S(t_j|\gamma)} \right) \exp( z_i \beta ) \right\}^{-1} \end{eqnarray*} and we get \begin{eqnarray*} c_{i} & = & \hat{S}(L_{i}) + \hat{S}(R_{i}) - 1 \end{eqnarray*} and \begin{eqnarray*} w_{j} & = & \hat{S}(t_{j-1}) \end{eqnarray*} This comes from the model of Equation~\ref{eq:g} when $F$ is the logistic distribution. Since in the absence of censoring the resulting test reduces to the Wilcoxon-Mann-Whitney test, we call this model from the \pkg{interval} package by the option \code{scores = "wmw"}. Other scores are possible (but less often used) from the model of Equation~\ref{eq:g} by choosing different distributions for $F$. The user may specify that $F$ is normal with \code{scores = "normal"}, or may supply an arbitrary distribution by using \code{scores = "general"} and specifying the function, $f \left\{ F^{-1}(\cdot) \right\}$ (see Equation~\ref{eq:perm}), using the \code{dqfunc} option. %\subsection{Scores under Right Censoring} For illustrative purposes, we now give the form of the three most often used scores in the special case of right-censoring. For this, we introduce new notation. Suppose that there are $m^*$ observed failure times, $t_1^* < t_2^* < \cdots < t^*_{m^*}$. In other words there are $m^*$ subjects for which $x_i = R_i$ is known, so that $L_i = \lim_{\epsilon \rightarrow 0} R_i - \epsilon \equiv R_i -0$. Let $n_j$ and $d_j$ be the number of subjects who are at risk or fail respectively at $t_j^*$. Then the Kaplan-Meier survival estimate is \citep[see e.g., ][]{Kalb:1980} \begin{eqnarray*} \hat{S}(t) & = & \prod_{j | t_j^* \leq t} \left( \frac{ n_j - d_j }{n_j } \right). \end{eqnarray*} In Table~\ref{tab:right} we summarize the formulation of the scores for the three model (score) choices in \pkg{interval}, as described above, for ordinary right-censored data. \begin{table} \noindent \begin{tabular}{lcc} Test & Score ($c_i$) for Observed & Score ($c_{i'}$) for Right-censored \\ (Model) & failure at $t_h^*$ & observation at $t_{h'}^*$ \\ \hline Logrank1 & $ 1 - \sum_{\ell = 1}^{h} \frac{d_{\ell}}{n_{\ell}}$ & $ - \sum_{\ell = 1}^{h} \frac{d_{\ell}}{n_{\ell}}$ \\ (Logistic, Sun) & \\ Logrank2 & $\frac{ n_h}{d_h } \left\{ - \log \left( \frac{n_h - d_h}{n_h} \right) \right\} + \log \hat{S}(t_h^*)$ & $\log \left\{ \hat{S}(t_{h'}^*) \right\}$ \\ (Group Proportional & & \\ Hazards, Finkelstein) & \\ Generalized WMW & $\hat{S}(t_{h-1}^*) + \hat{S}(t_h^*) -1$ %$\prod_{j = 0}^{h-1} \left( \frac{ n_j - d_j }{n_j } \right) %+ \prod_{j = 0}^{h} \left( \frac{ n_j - d_j }{n_j } \right) - 1 $ & %$\prod_{j = 0}^{h'} \left( \frac{ n_j - d_j }{n_j } \right) -1$ \\ $\hat{S}(t_{h'}^*) - 1 $ \\ (Proportional Odds) & \\ \hline \end{tabular} \begin{centering} \caption{Scores for right-censored data. \label{tab:right} } \end{centering} \end{table} \section{Application} \label{sec-application} The calls to the \pkg{interval} package are designed to be in the same format as in the \pkg{survival} package. As noted in the previous section, the \code{icfit} and \code{ictest} functions will also work on right-censored data (see \code{demo("right.censored.examples")}). We demonstrate the two main functions in \pkg{interval}, \code{icfit} and \code{ictest}, with the commonly used interval-censored breast cosmesis data set of \cite{Fink:1985}. The data are from a study of two groups of breast cancer patients, those treated with radiation therapy with chemotherapy (\code{treatment = "RadChem"}) and those treated with radiation therapy alone (\code{treatment = "Rad"}). The response is time (in months) until the appearance of breast retraction, and the data are interval-censored between the last clinic visit before the event was observed (left) and the first visit when the event was observed (right) (or \code{Inf} if the event was not observed). The following provides the first few observations of the data set. <>= ## set output line size options(width = 65) library(interval) library(coin) @ <<>>= library("interval") data("bcos", package = "interval") head(bcos) @ \subsection{Survival estimation} First, we calculate the NPMLE for each treatment group in the breast cosmesis data separately. <<>>= fit1<-icfit(Surv(left, right, type = "interval2")~treatment, data = bcos) summary(fit1) @ The \code{Surv} function is from the \pkg{survival} package, and the \code{type = "interval2"} treats the variables \code{left} and \code{right} as the left and right endpoints of the interval. The assumption is that the left interval is excluded but the right one is included, except that if both are equal then both are included, and values of left may be 0 and values of right may be \code{Inf}. One can change the inclusion/exclusion defaults by using the \code{Lin} and \code{Rin} options. These results match those calculated from \pkg{Icens}, an available package for computing the NPMLE for censored and truncated survival data \citep[see][]{Gent:2001}. The \code{summary} function applied to an ``\code{icfit}'' object gives the intervals with positive probability for the NPMLE of the survival distribution function, i.e. where the estimated survival distribution drops; however, there are infinitely many functions which drop exactly the same increment within those intervals. The NPMLE is only unique outside of the intervals which are listed from the summary of the fit. For example, there are infinitely many survival functions for the \code{treatment = "Rad"} group, that have $S(4)=1$ and $S(5)=1-0.0463= 0.9537$. Thus, as has been done in the \code{Icens} package, when plotting the NPMLEs we denote the areas with the indeterminate drops with grey rectangles. The function which linearly interpolates the survival within these indeterminate regions is also displayed on the graph. We plot the NPMLE for each treatment group using \code{plot(fit1)} to get Figure~\ref{fig:npmle.strat}. \begin{figure}[t] \centering \label{fig:npmle.strat} <>= plot(fit1) @ \caption{Non-parametric maximum likelihood survival from breast cosmesis data.} \end{figure} \subsection{Two-sample weighted logrank tests} There are five score tests available in \code{ictest}, which are selected by setting the \code{scores} argument to be one of \code{"logrank1"}, \code{"logrank2"}, \code{"wmw"}, \code{"normal"}, or \code{"general"}. As stated in Section~\ref{sec-details}, the two forms of the logrank scores are those associated with \cite{Fink:1986} and those associated with \cite{Sun:1996}. Although \cite{Fink:1986} are perhaps more natural for interval-censored data, we make those of \cite{Sun:1996} the default (\code{scores = "logrank1"} or equivalently \code{rho = 0}) since these scores reduce to the usual logrank scores with right-censored data. The default method is the permutation test, and since the sample size is sufficiently large we automatically get the version based on the permutational central limit theorem: <<>>= icout<-ictest(Surv(left, right, type = "interval2")~treatment, data = bcos) icout @ To pick which of the rank score methods is best, one may plot for each treatment group the NPMLE of the distribution transformed by the inverse of the error distribution from the grouped continuous model \citep[see][]{Fay:1996}. In generalized linear models these transformations are known as links. For example, the proportional hazards is motivated by the extreme value error distribution whose inverse is the complementary log-log link, which is the default. In Figure~\ref{fig:cll.plot} we plot this using the \code{fit1} ``\code{icfit}'' object which contains the NPMLE for each treatment group using the code: \code{plot(fit1, dtype = "link")}. \begin{figure}[t] \centering <>= plot(fit1, dtype = "link") @ \caption{Complementary log-log transformation of distribution from breast cosmesis data. If parallel then proportional hazards is a good model.} \label{fig:cll.plot} \end{figure} Other links besides the default complementary log-log link are possible: \code{dlink = qlogis} or \code{dlink = qnorm} for the fit of the proportional odds and normal model, respectively. \cite{Fay:1996} presents those plots for these data and none of the models looks clearly better than the others. Because a major part of the calculation of the test statistic is estimating the NPMLE under the null hypothesis (i.e., for the pooled treatment groups), this NPMLE is saved as part of the output (\verb@icout$fit@, in the above example) so that we can calculate this NPMLE once and reuse it for the calculation of the other two score tests. Here is code for the Finkelstein logrank formulation, which takes advantage of a precalculated NPMLE: <<>>= ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, initfit = icout$fit, scores = "logrank2") @ Notice how the two different logrank tests give very similar results, $p$ close to 0.007 in both cases. We demonstrate the third score test, the generalization of the Wilcoxon-Mann-Whitney scores to interval-censored data, and also demonstrate the \code{ictest} function in default mode: <<>>= L<-with(bcos, left) R<-with(bcos, right) trt<-with(bcos, treatment) ictest(L, R, trt, scores = "wmw", initfit = icout$fit) @ \subsection{$K$-sample and trend tests} We can perform $k$-sample tests using the \code{ictest} function. We create fake treatment assignments with four treatment groups for the individuals in the breast cosmesis data set to demonstrate. <>= set.seed(1232) @ <<>>= fakeTrtGrps<-sample(letters[1:4], nrow(bcos), replace = TRUE) ictest(L, R, fakeTrtGrps) @ When \code{scores = "wmw"} and the responses are all non-overlapping intervals then this reduces to the Kruskal-Wallis test. The function \code{ictest} performs a trend test when the covariate is numeric. The one-sided test with \code{alternative = "less"} rejects when the correlation between the generalized rank scores (e.g., WMW scores or logrank scores) and the covariate are small. Below, a continuous covariate $z$ that is uncorrelated to the outcome is created for individuals in the cosmesis dataset to illustrate the trend test. <>= set.seed(931) @ <<>>= fakeZ<-rnorm(nrow(bcos)) ictest(L, R, fakeZ, alternative = "less") @ \subsection{Exact permutation tests} We can also estimate the exact permutation $p$-value for any score choice in \code{ictest} using the \code{exact} argument. Here the logrank test using \cite{Sun:1996} scores is redone as an exact test: <<>>= ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, exact = TRUE, scores = "logrank1") @ The exact argument automatically chooses between an exact calculation by network algorithm or an approximation to the exact $p$-value by Monte Carlo through the \code{methodRuleIC1} function. In this case the network algorithm was expected to take too long and the Monte Carlo approximation was used. If a more accurate approximation to the exact $p$-value is needed then more Monte Carlo simulations could be used and these are changed using the \code{mcontrol} option. Additionally, if \code{icout} is an ``\code{ictest}'' object created by the \code{ictest} function, then \code{icout$scores} will give the vector of rank scores, $c_i$, which may be imputed into other software (e.g., StatXact) to create an exact permutation test. A more seamless interaction is possible with the \pkg{coin} package (see Section~\ref{sec-perm.coin}). \subsection{Other test options} All of the above are permutation based tests, but we may use other methods. Here are the results from the usual score test for interval-censored data: <<>>= ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, initfit = icout$fit, method = "scoretest", scores = "logrank2") @ where in this case the nuisance parameters are defined after calculation of the NPMLE as described in \cite{Fay:1996}. The results agree exactly with \cite{Fay:1996} and are similar to those in \cite{Fink:1986}. The very small differences may be due to differing convergence criteria in the NPMLE. The imputation method of \cite{Huan:2008} may also be employed (note that \code{scores = "logrank2"}, \code{"normal"}, or \code{"general"} are not available for this method): <<>>= icoutHLY<-ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, initfit = icout$fit, method = "wsr.HLY", mcontrol = mControl(nwsr = 99), scores = "logrank1") icoutHLY @ These results agree with \cite{Huan:2008} within the error to be expected from such an imputation method \citep[][ had $p=0.0075$]{Huan:2008}. For the breast cosmesis data, if we can assume that the assessment times are independent of treatment arm, then the assumptions needed for the permutation methods and the imputation methods are met. The assumptions for the theoretical use of the score function do not hold because the NPMLE is on the boundary of the parameter space since some masses were set to zero in the calculation of the NPMLE (although the {\it ad hoc} adjustment appears to give reasonable results). When some of these masses are set to zero then the \code{anypzero} element of the ``\code{icfit}'' object will be TRUE, as we see here: <<>>= icoutHLY$fit$anypzero @ If we cannot assume independence of assessment times and treatment arm then the exchangeability assumption needed for the permutation method is not met, and the imputation method may be used. Further research is needed on the practical ramifications of the violation of any of these assumptions. \section{ The perm package} \label{sec-perm} The default method for inference in the \pkg{interval} package is the permutation test. The \pkg{perm} package presented here is a stand-alone \proglang{R} package that performs linear permutation tests. The tests that can be done in \pkg{perm} can also be done in the existing \pkg{coin} package; however, there are slight differences in the calculations and presentation, as outlined below. \subsection{Overview of methods} \label{sec-permoverview} The \pkg{perm} package does linear permutation tests, meaning permutation tests where the test statistic is either of the form, \begin{eqnarray} T({\bf y}, {\bf x}) & = & \sum_{i=1}^{n} c_i z_i \label{eq:cizi} \end{eqnarray} as in Equation~\ref{eq:perm}, or of a quadratic version of $T({\bf y}, {\bf x})$ (e.g., see $k$-sample tests below). We consider only the case where $c_i$ is a scalar and $z_i$ is either a scalar or a $k \times 1$ vector \citep[although more general cases are studied in ][]{Sen:1985,Hoth:2006}. Following \cite{Sen:1985}, we can write the mean and variance of $T$ under the permutation distribution (i.e., permute indices of $c_1, \ldots, c_n$ and recalculate $T$, where there are $n!$ different permutations with each equally likely) as, \begin{eqnarray*} U = \E_{P}(T) & = & n \bar{c} \bar{z} \\ & \mbox{} & \\ V = \VAR_{P}(T) & = & \frac{1}{n-1} \left\{ \sum_{i=1}^{n} (c_i - \bar{c} )^2 \right\} \left\{ \sum_{j=1}^{n} (z_i - \bar{z}) (z_i - \bar{z})^{\top} \right\}, \end{eqnarray*} where $\bar{c}$ and $\bar{z}$ are the sample means. In the \pkg{perm} package, if $z_i$ is a scalar we define the one-sided $p$-value when \code{alternative = "greater"} as \begin{eqnarray*} p_G = \frac{ \sum_{i=1}^{n!} I(T_i \geq T_0) }{n!}, \end{eqnarray*} where $I(A)=1$ when $A$ is true and 0 otherwise, $T_i$ is the test statistic for the $i$th of $n!$ possible permutations, and $T_0$ is the observed value of $T$. When \code{alternative = "less"} then the $p$-value, say $p_L$, is given as above except we reverse the direction of the comparison operator in the indicator function. Note that if you add or multiply by constants which do not change throughout all permutations then the one-sided $p$-values do not change. Thus, a permutation test on $T$ can represent a test on the difference in means in the two-sample case, and can represent a test on the correlation when $z_i$ is numeric. When \code{alternative = "two.sided"}, then there are two options of how to calculate the $p$-value defined by the \code{tsmethod} option of the \code{control} variable. When \code{tsmethod = "central"}, the two-sided $p$-value is $p_2$, twice the minimum one-sided $p$-value (i.e., $p_2=\min(1, 2\min(p_L, p_G))$), and when \code{tsmethod = "abs"} then the two-sided $p$-value is \begin{eqnarray*} p_{2A} = \frac{ \sum_{i=1}^{n!} I( |T_i-U| \geq |T_0-U|) }{n!}. \end{eqnarray*} When $z_i$ is a $k \times 1$ vector, we consider only the \code{alternative = "two.sided"} (in this case both \code{tsmethod} options give the same result), and the Wald-type test statistic $Q=(T-U)^{\top} V^{-} (T-U)$ is calculated, where $V^{-}$ is the generalized inverse of $V$. To calculate the exact $p$-values, one may use complete enumeration, but this quickly becomes intractable and other algorithms are needed. One algorithm supplied is the network algorithm \citep[see e.g., ][]{Agre:1990}. Monte Carlo approximations to the exact $p$-values may also be performed. Finally, the \pkg{perm} allows asymptotic methods such as the permutational central limit theorem (PCLT). \cite{Sen:1985} reviews the PCLT which shows that under the permutation distribution with standard regularity conditions on the $c_i$ and $z_i$, $V^{-1/2} (T-U)$ is asymptotically approximately multivariate normal with mean 0 and variance the identity matrix. Note that because of the way exact $p$-values are defined, minuscual differences in the way the computer treats ties between $T_0$ and the $T_i$ can lead to non-negligible differences in the calculated exact $p$-values for small samples. This is a problem for all permutation test software, but because of the generality of the \pkg{perm} package (i.e., the $T_i$ can be any values) it can be a particularly difficult one. The solution taken by \pkg{perm} and by \pkg{coin} (Versions 1.0-8 or later) is to round so that differences between $T_i$ and $T_0$ that are close to zero become zero. In \pkg{perm} this is done with the \code{digits} option, which by default rounds the $T_i$ to the nearest 12 digits. This can particularly be a problem for WLRTs, as is shown by the example in Section~\ref{sec-ties}. \subsection{Design and implementation} \label{sec-permDI} The three main functions in \pkg{perm} perform the two-sample (\code{permTS}), $k$-sample (\code{permKS}), and trend (\code{permTREND}) tests. To demonstrate the package we will use the \code{ChickWeight} data set in the \pkg{datasets} package which is part of the base distribution of \proglang{R}. The data are from an experiment on chicks fed one of four diets. We use only the weight in grams of the chicks at day 21 (i.e., \code{Time == 21}) as the response. For the two-sample examples that follow, we use only the chicks given diets 3 and 4. <<>>= data("ChickWeight", package = "datasets") head(ChickWeight) @ The package uses the \proglang{S}3 methods, which allow object-oriented programming. The evaluation of any of the three main functions is determined by the class of the first object in the function call. This is similar to the calls to the analogous functions in the \pkg{stats} package. For example, as is possible using \code{t.test} or \code{wilcox.test} from the \pkg{stats} package, we can call the test using the formula structure, which automatically uses the \code{permTS.formula} function. The formula is \code{weight~Diet} where the $i$th element of weight represents $c_i$ and the $i$th element of Diet represents $z_i$ in Equation~\ref{eq:cizi}. We also use the data and subset variables to name the data set and pick out only the Day 21 weights of those chicks who got Diets 3 or 4. <<>>= permTS(weight~Diet, data = ChickWeight, subset = Diet %in% c(3, 4) & Time == 21) @ Equivalently, we can define the responses explicitly and do the same test, with the default structure, <<>>= y3<-with(subset(ChickWeight, Time == 21 & Diet == 3), weight) y4<-with(subset(ChickWeight, Time == 21 & Diet == 4), weight) permTS(y3, y4) @ The \code{permTS} uses a function (determined by the option \code{methodRule}) to automatically choose the method used in the permutation test. Since in this case the exact network method is not expected to give an quick answer, the default \code{methodRule} automatically chooses to use the PCLT. If the sample sizes are small enough, then exact methods are done automatically. For example, using only the first 5 chicks on each diet, then the \code{methodRule} function chooses the network algorithm: <<>>= permTS(y3[1:5], y4[1:5]) @ Note that the network algorithm is written entirely in \proglang{R} code, so efficiency gains may be possible by translating portions of the code into \proglang{C} code \citep[see e.g., ][]{Cham:2008}. In addition to the \code{pclt} and \code{exact.network} methods, the two-sample test additionally has both a complete enumeration exact algorithm (\code{method = "exact.ce"}), which is useful for simulations involving a small number of observations in each group, and a Monte Carlo approximation to the exact $p$-value (\code{method = "exact.mc"}). The user may supply their own \code{methodRule} function, which must have three input values: the numeric vectors $c=[c_1, \ldots, c_n]$ and $z=[z_1, \ldots, z_n]$ (see Equation~\ref{eq:cizi}), and the logical variable \code{exact} given by the option of the same name. For the two-sample test the output of a \code{methodRule} function should be a character vector which is one of \code{"pclt"}, \code{"exact.network"}, \code{"exact.ce"}, or \code{"exact.mc"}. The logical variable \code{exact} causes the default \code{methodRule} for \code{permTS} to choose from among the three exact algorithms based on the estimated speed of the calculations (see help for \code{methodRuleTS1}). All methods except \code{"exact.mc"} produce an ``\code{htest}'' object, which is a list with elements described in the help for \code{permTS} and printed according to the print method from the \pkg{stats} package. The \code{"exact.mc"} method creates an ``\code{mchtest}'' object, which additionally prints out confidence intervals on the $p$-value based on the Monte Carlo replications to help the users determine if the $p$-value would change much if the Monte Carlo procedure was repeated with a different random seed. Note that even if the confidence interval on the $p$-value is large, the given $p$-value from the \code{"exact.mc"} method (i.e., the quantity one plus the number of Monte Carlo replications that are equal to or more extreme than the observed test statistic divided by the quantity one plus the number of replications) is always a valid $p$-value \citep[see e.g., ][equation 5.3]{Fay:2007}. For the $k$-sample test the calls may also be made by formula structure, <<>>= permKS(weight~Diet, data = ChickWeight, subset = Time == 21) @ or by explicit definition of the response ($c$) and group ($z$) vectors, <<>>= y<-ChickWeight[ChickWeight$Time == 21, "weight"] g<-ChickWeight[ChickWeight$Time == 21, "Diet"] permKS(y, g) @ The \code{methodRule} function works the same way as for \code{permTS} except a different default \code{methodRule} function is used since the \code{exact.network} method is not available for the $k$-sample test. If we can assume for illustration that the diets are inherently ordered, then we may want to use the trend test. For the trend test (i.e., when $z_i$ is a scalar) the calls may also be made by formula structure (not shown), or the default, <<>>= permTREND(y, as.numeric(g)) @ Similar \code{methodRule} functions may be used (see help for \code{permTREND}). Options for the algorithm methods are controlled by the variable \code{control}, and the function \code{permControl} allows changing of only a subset of the options. The output from \code{permControl} is a list of all the options. Most of the options pertain to the \code{"exact.mc"} method: \code{nmc} gives the number of Monte Carlo replications, \code{p.conf.level} gives the confidence level calculated on the $p$-value, \code{seed} gives the random number seed, and \code{setSEED} is a logical telling whether or not to set the seed. Other options are: \code{digits} adjusts the rounding of the test statistics to decide on which to call ties (see Section~\ref{sec-ties}), \code{tsmethod} gives two options for the two-sided method for calculating the $p$-values (see Section~\ref{sec-permoverview} above). The last option is \code{cm} and is used in the following scenario. Suppose one wanted to do a simulation on data of the same size as the example. Then one could calculate the complete enumeration matrix once (using the \code{chooseMatrix} function), and then each simulation could reuse that matrix. This will save time as illustrated below on the ChickWeight data: <>= ## set output line size options(width = 65) @ <<>>= system.time(cm19c10<-chooseMatrix(length(y3)+length(y4), length(y3))) system.time(PC<-permControl(cm = cm19c10)) system.time(permTS(y3, y4, method = "exact.ce", control = PC)) system.time(permTS(y3, y4, method = "exact.network")) @ In a simulation, the first calculation only needs to be done once. \subsection{Comparison with coin package} \label{sec-perm.coin} This section compares the two permutation packages, \pkg{coin} (version 1.0-8) and \pkg{perm} (version 1.0-0.0). The primary motivation for the creation of the \pkg{perm} package is for an independent, within \proglang{R}, validation of the \pkg{coin} package. All checks between \pkg{coin} and \pkg{perm} have shown no differences (see perm.coin.check.R in the test directory). In many ways, \pkg{coin} is the more comprehensive and general package of the two. For example, \pkg{coin} allows the following not supported in \pkg{perm}: user supplied transformations on the responses and covariates, other (nonlinear) test statistics, stratification, and multiple responses and covariates. Further, the exact algorithms in \pkg{coin} are faster than those in \pkg{perm}. Here are some minor ways that \pkg{perm} is different from \pkg{coin}. First, the print method for \pkg{coin} prints a standardized $Z$ statistic to show direction of the effect, while the print method for \pkg{perm} prints the difference in means and only additionally prints the $Z$ statistic when the PCLT is used. Second, the \pkg{perm} package allows \code{methodRule} functions, as previously described, to automatically choose the calculation method. No similar feature is offered in \pkg{coin}. Third, when using the Monte Carlo approximation to the exact (\code{method = "exact.mc"} in \pkg{perm} and \code{distribution = approximate()} in \pkg{coin}), then \pkg{perm} prints confidence intervals automatically with the print method, while \pkg{coin} prints them only when using the \code{pvalue} function. Fourth, the \pkg{perm} package allows the \code{"exact.ce"} method, which does complete enumeration. If a simulation is desired for the two-sample test on a small sample size, then using the \code{"exact.ce"} method with the \code{cm} variable fixed by the control option (so that it does not need to be recalculated for each simulation) may give faster results than other algorithms. Fifth, the \pkg{perm} package allows two types of two-sided $p$-values (see \code{permControl}: option \code{tsmethod = "central"} (default) and \code{tsmethod = "abs"}), while the \pkg{coin} allows only one type of alternative (denoted \code{tsmethod = "abs"} in \pkg{perm}). We emphasize that these differences are minor and when the two packages do the same analysis, both are similar. Further extensibility of the \pkg{perm} package may not be needed since many of the ways it may be expanded are covered by the \pkg{coin} package. \section{ The interval package} \label{sec-interval} We have already given some examples of how to use the \pkg{interval} package. In this section, we give more details on the structure of the package. \subsection{Design and implementation} \label{sec-intervalDI} The \pkg{interval} package uses \proglang{S}3 methods. The two main functions are the \code{icfit} function and the \code{ictest} function. Both functions allow a formula as well as a default implementation, and both implementation styles were presented in Section~\ref{sec-application}. Although the typical response for the formula will be of the ``\code{Surv}'' class, from the \pkg{survival} package, we also allow numeric responses and these are treated as exactly observed values. The \code{icfit} function outputs an object of class ``\code{icfit}'' which is a list with the elements described in the help, and with most elements exactly as in the ``\code{icsurv}'' class of the \pkg{Icens} package. The ``\code{icfit}'' class is different from the ``\code{icsurv}'' class primarily because it allows the NPMLE of the distributions for multiple strata to be stored as one ``\code{icfit}'' object as is possible in the ``\code{surv}'' class of the \pkg{survival} package. For example, if the right hand side of the formula contains a factor object with $k$ factors then the resulting ``\code{icfit}'' object will contain $k$ separate NPMLEs, one for each factor. In that case the \code{strata} element will be a numeric vector of size $k$ giving the number of elements in each strata, and the other objects (e.g., the vector of probability masses, \code{pf}) will be larger to include all the separate NPMLEs. The NPMLEs are separated by stratum when either the \code{summary} or \code{plot} methods are applied to the ``\code{icfit}'' objects. The available methods for ``\code{icfit}'' objects are \code{print}, \code{summary}, \code{"["} and \code{plot}. The \code{print} method prints as a list except the `A' matrix (described below) is not printed, only its dimensions are given. The \code{summary} and \code{plot} methods have been shown in Section~\ref{sec-application} and they either print or plot on one graph the NPMLE for each stratum. The \code{"["} method allows picking out the $i$th stratum from an ``\code{icfit}'' object. Here are some details on the calculations in \code{icfit}. The \code{icfit} function calls a separate function, \code{Aintmap}, that calculates an `A' matrix and the `intmap'. The A matrix is an $n \times m$ matrix of zeros and ones with the $ij$th element being an indicator of whether the interval associated with the $i$th observation contains the $j$th interval of the intmap. The intmap is a matrix which gives left and right endpoints of the intervals associated with the columns of A, and the attributes of the intmap tell whether the endpoints are included or not as determined by the \code{Lin} and \code{Rin} options. The default is to exclude the left interval and include the right (i.e., $(L, R]$), except when either $L=R$ (then the intervals are treated as exact, i.e., $[R, R]$) or $R=\infty$ which is not included. Differences in the inclusion of the endpoints can change the results \citep[see][]{Ng:2002}. When the intervals of the intmap are regions of maximal cliques then the A matrix is the transpose of the incidence or clique matrix defined in \cite{Gent:2002}. The \code{Aintmap} is called internally by the \code{icfit} function, and the innermost intervals (i.e., regions of maximal cliques) are calculated to possibly reduce the number of columns of the A matrix. Once an A matrix is calculated and reduced to represent only innermost intervals, the initial estimate of the survival distribution is needed for the E-M algorithm. The \code{initfit} option controls that initial estimate. An allowable option for \code{initfit} is to provide an initial NPMLE estimate as an object of either class ``\code{icfit}'' or ``\code{icsurv}''. Another option for \code{initfit} is a character vector giving the name of a function to calculate an initial fit. This function must have as inputs any or all of five variables (L, R, Lin, Rin, and/or A), and must output a vector of probability masses estimating the distribution and optionally it may output the corresponding intmap. The default for \code{initfit} is \code{initcomputeMLE}, a function that calls \code{computeMLE} from the \pkg{MLEcens} package. Note that if the \code{initfit} function, such as the \code{initcomputeMLE} function, gives an error then the \code{icfit} ignores this calculation, gives a warning, and calculates a very simple initial distribution. In the \code{control} option of \code{icfit}, other values may be passed to the \code{initfit} function through the \code{initfitOpts} element of \code{control}, and \code{initfitOpts} must be a named list of options. Once the initial distribution is calculated, it is used as the starting value in a modified E-M algorithm that allows `polishing' elements to zero, then subsequently checking the Kuhn-Tucker conditions \citep[see][]{Gent:1994}. If the initial distribution is very close to the NPMLE, then convergence may happen on the first iteration. On the other hand, the initial distribution need not be very close to the NPMLE but convergence can still happen. If the initial distribution has some intervals set to zero that should not be, then the Kuhn-Tucker conditions will not be met and the \code{message} value of the resulting ``\code{icfit}'' object (e.g., \verb@fit$message@) will state this fact. We test in \code{demo("npmle")} that the NPMLE estimates from the \pkg{Icens} package match those from the \code{icfit} function. In that file we compare the NPMLE from the two packages for the cosmesis data. Additionally in the demo, we simulate 30 other data sets and show that the NPMLEs match for all the simulated data sets (data not shown). The \code{ictest} function outputs an object of class ``\code{ictest}'' for which there is a \code{print} method, modeled after the \code{print} method for the ``\code{htest}'' class used in the \pkg{stats} package that comes with base \proglang{R}. Objects of class ``\code{ictest}'' are lists with many objects (see \code{ictest} help). There are four choices for a predefined type of rank score: \code{"logrank1"} \citep[scores described in][]{Sun:1996}, \code{"logrank2"} \citep[scores described in][]{Fink:1986}, \code{"wmw"} and \code{"normal"} \citep[the WMW scores or normal scores described in][]{Fay:1996}. Additionally, the option \code{"general"} allows general scores for arbitrary error distributions on the grouped continuous model \citep[see][]{Fay:1996}. To show the general scores we consider the logistic error distribution. We can show that these scores are equivalent to the Wilcoxon-type scores (within computation error): <<>>= icout<-ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, scores = "wmw") wmw.scores<-icout$scores logistic.scores<-ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, icFIT = icout$fit, scores = "general", dqfunc = function(x){ dlogis(qlogis(x))})$scores max(abs(wmw.scores-logistic.scores)) @ There are many inferential methods available for \code{ictest} as described in Section~\ref{sec-inferential.method} and the method may be either explicitly stated as a character vector or may be the result of a \code{methodRule} function. The \code{methodRule} function works similarly as in the \pkg{perm} package, except the input must be three objects: the vector of rank scores, the vector of group membership values, and \code{exact}, a logical value coming from the object of the same name in the input. Other \code{methodRule} functions may be created to automatically choose the method based on a function of these three objects, but the default is \code{methodRuleIC1} (see help for that function for details). Note that permutation inferences are available for all types of rank scores, but other types of inferences are not available for all the scores; see Section~\ref{sec-inferential.method} and the \code{ictest} help for details. Here is an overview of the calculation functions used in \code{ictest}. First, unless \code{icFIT} is given, the NPMLE of the distribution for all individuals is calculated using the \code{icfit} function. Any options used with the \code{icfit} function may be passed from the \code{ictest} call by using the \code{icontrol} option. Using the resulting NPMLE from the \code{icfit} call, the rank scores are calculated using the \code{wlr_trafo} function. Similar to the \code{ictest} function, \code{wlr_trafo} is an \proglang{S}3 function with a default method and one for ``\code{Surv}'' objects, but additionally there is a method for ``\code{data.frame}'' objects. In the ``\code{data.frame}'' method, there must be only one variable which has either a ``\code{Surv}'' or ``\code{numeric}'' class. The purpose of the ``\code{data.frame}'' method is to properly interact with the \pkg{coin} package (see Section~\ref{sec-interval.coin} below). Once the rank scores are calculated, then other functions are called depending on the inferential method chosen: the \code{icScoreTest} function for the score test, the \code{icWSR} function for the imputation approach, and the functions from \pkg{perm} for the permutation approaches. \subsection{Interacting with the coin package} \label{sec-interval.coin} The \pkg{coin} package allows different transformations for the response variables and we can use the \code{wlr_trafo} function as a transformation function within \pkg{coin}. <<>>= library("coin") independence_test(Surv(left, right, type = "interval2")~treatment, data = bcos, ytrafo = wlr_trafo) @ This repeats the asymptotic results from the \code{method = "pclt"} of \code{ictest}. The usefulness of \pkg{coin} are the fast algorithms for exact permutation calculations. Even these fast methods are still intractable for the full breast cosmesis data set, but we show here how the method may be applied quickly to a subset of that data. <<>>= SUBSET<-c(1:5, 50:65) independence_test(Surv(left, right, type = "interval2")~treatment, data = bcos, subset = SUBSET, ytrafo = wlr_trafo, distribution = exact()) ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, subset = SUBSET, method = "exact.network") @ The $p$-values are different since, as discussed in Section~\ref{sec-perm.coin}, the default two-sided method is different for the \pkg{coin} package (corresponding to \code{control} option \code{tsmethod = "abs"} in \pkg{perm}) than the \pkg{perm} package (default uses \code{tsmethod = "central"}) and hence also the \pkg{interval} package. When we use \code{tsmethod = "abs"} then we reproduce the results from \pkg{coin}: <<>>= ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, subset = SUBSET, method = "exact.network", mcontrol = mControl(tsmethod = "abs")) @ Note that the algorithms in \pkg{coin} can be considerably faster. To show this consider a larger subset of the breast cosmesis data: <<>>= SUBSET2<-c(1:12, 47:58) system.time( independence_test(Surv(left, right, type = "interval2")~treatment, data = bcos, subset = SUBSET2, ytrafo = wlr_trafo, distribution = exact()) ) system.time( ictest(Surv(left, right, type = "interval2")~treatment, data = bcos, subset = SUBSET2, method = "exact.network", mcontrol = mControl(tsmethod = "abs")) ) @ \subsection{On handling ties for exact permutation implementation} \label{sec-ties} In implementing the exact version of permutation tests, the way ties are handled may change the resulting $p$-values by non-negligible amounts. In this section we detail a simple artificial example to show how the handling of ties is difficult, in terms of reliably reproducing results, and we show that the \code{ictest} function gives the correct results. Consider the data set with: <<>>= L<-c(2, 5, 1, 1, 9, 8, 10) R<-c(3, 6, 7, 7, 12, 10, 13) group<-c(0, 0, 1, 1, 0, 1, 0) example1<-data.frame(L, R, group) example1 @ In this case we can calculate the NPMLE exactly and give it in Table~\ref{tab:NPMLE1}. \begin{table} \begin{centering} \begin{tabular}{ccc} $(L$ & $R]$ & probability \\ \hline 2 & 3 & $2/7$ \\ 5 & 6 & $2/7$ \\ 9 & 10 & $3/14$ \\ 10 & 12 & $3/14$ \\ \end{tabular} \end{centering} \caption{NPMLE for \code{example1}. \label{tab:NPMLE1}} \end{table} We calculate this NPMLE with the \pkg{interval} package as <<>>= summary(icfit(L, R), digits = 12) @ which matches the exact to at least 12 digits: <<>>= print(3/14, digits = 12) @ Usually the fit will not be this close, and the closeness of the fit is determined by the \code{icfitControl} list (see the help). The problem relates to the numerical precision of the calculated rank scores and subsequent permutation $p$-value when there is a small number of permutations and ties in the test statistics with different permutations (for interval-censoring, possibly stemming from overlapping intervals). While not unique to interval-censored data, this combination of factors may be more common in this setting. We can calculate the exact scores for the Sun method (Equation~\ref{SunScore} and \code{scores = "logrank1"}) these are \begin{eqnarray*} & & \left[ \frac{5}{7}, \frac{11}{35}, \frac{18}{35}, \frac{18}{35}, - \frac{24}{35}, -\frac{13}{70}, -\frac{83}{70} \right] \end{eqnarray*} These scores sum to zero (as do all such scores regardless of the model). There are $\left( \begin{array}{c} 7 \\ 3 \end{array} \right)=35$ unique permutations with equal probability. Note that the difference in means of the original scores, (with group=$[0, 0, 1, 1, 0, 1, 0]$), gives equivalent values to the permutation with group=$[1, 1, 0, 0, 0, 1, 0]$ because the sum of the first and second scores equals the sum of the third and fourth scores. Thus, we have a tie in the permutation distribution. We need to make sure the computer treats it as a tie otherwise the $p$-value will be wrong. Dealing with ties in computer computations can be tricky \citep[see \proglang{R} FAQ 7.31][]{R-FAQ}. To see the details, we completely enumerate all the sums of the scores in one group. We use the function \code{chooseMatrix} from \pkg{perm} to generate the full list of permutations of the original \code{group} variable. We print out only the first $9$ of the $35$ ordered test statistics, placing the difference in means in the 8th column, next to the permutation of the group allocation: <<>>= score1<-wlr_trafo(Surv(L, R, type = "interval2")) cm<-chooseMatrix(7, 3) T<- ( (1-cm) %*% score1 )/4 - ( cm %*% score1 )/3 cbind(cm, T)[order(T), ][1:9, ] @ The seventh and eighth largest of the 35 test statistics are tied, and the eighth largest is equal to the original group assignment, so that the one sided $p$-value is $8/35= 0.2286$. The function \code{ictest} properly calculates this $p$-value. The way that \pkg{perm} can directly address the ties issue is to allow the user to specify numerical precision, i.e. rounding to the nearest \code{permControl()\$digits} significant digits; and \pkg{perm} treats values of the permutation distribution that are tied for that many significant digits as true ties. We have not shown that this method of breaking ties always works properly; however, it does work in all the cases we have tried. We would like to emphasize that this issue is only a problem with small sample sizes using exact permutation methods. Additionally, it is a problem with all permutation tests where the test statistics have a non-zero probability of creating a tie. Very small differences in the rank scores will only produce correspondingly small differences in the asymptotic approximation, so as the sample sizes get large and, as guaranteed by the permutational central limit theorem, the estimate becomes more accurate, the way ties are handled effects large sample results minimally. \section*{Acknowledgements} We would like to thank the editors and anonymous reviewers for the Journal of Statistical Software for their valuable comments that have improved this paper and the packages. \bibliography{interval} \end{document}