---
title: "Introduction to propertee"
author: "propertee Authors"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to propertee}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
#devtools::load_all("~/repositories/_r/propertee/")
library(propertee)
```

## Main Features

The **propertee** package, Prognostic Regression Offsets with
Propagation of ERrors (for Treatment Effect Estimation), facilitates
direct adjustment or standardization for experiments and observational
studies, with the option of using a separately fitted prognostic
regression model for covariance adjustment. Propertee calls for a
self-standing specification of a study's design and treatment
allocations, using these to create weights needed to align treatment
and control samples for estimation of treatment effects averaged
across a common standard population, with standard errors that
appropriately reflect the study design. For covariance adjustment it
enables offsetting the outcome against predictions from a dedicated
covariance model, with standard error calculations propagating error
as appropriate from the covariance model.

The main workflow consists of two main steps and one optional step:

1. Encode the study's design, including the unit of assignment,
   treatment status of each unit of assignment, and block membership
   information as appropriate, in a `StudySpecification` object. This
   is accomplished with the `obs_spec()`, `rct_spec()` or `rd_spec()`
   functions.
2. Optionally, fit a covariate adjustment model.  This is done with
   any of a number of modeling functions external to `propertee`, for
   example `lm()` or `glm()` from the `stats` package, or `lmrob` from
   the `robustbase` package.
3. Use `lm()` to contrast the separately reweighted treatment and
   control groups, incorporating optional covariance adjustments as an
   `offset` to `lm()`. The `StudySpecification` is unobtrusively
   retained, as is error propagation from the optional covariance
   adjustment, for use in subsequent standard error calculations.
   This is done using the `lmitt()` function, which either takes and
   enriches a fitted `lm` or, in the more fully supported usage,
   internally calls `lm()` to itself create a directly adjusted
   contrast of treatment conditions.

## Example Data

The example dataset comes from the state of Tennessee's Student-Teacher
Achievement Ratio (STAR) experiment. Students were randomly assigned to three
possible classroom conditions: small (13 to 17 students per teacher), regular
class (22 to 25 students per teacher), and regular-with-aide class (22 to 25
students with a full-time teacher's aide).

```{r}
data("STARplus")
table(STARplus$cond_at_entry)
```

For simplicity for this first example, we will examine a single binary
treatment - "small" classrooms versus "regular" and "regular+aide" classrooms.

```{r}
STARplus$cond_small <- STARplus$cond_at_entry == "small"
table(STARplus$cond_small)
```

After this basic example, we will see how **propertee** makes it easy to handle
non-binary treatment variables by introducing `dichotomy`s.

The outcome of interest is a reading score at the end of kindergarten.

```{r}
summary(STARplus$g1treadss)
```


We first take the random assignment scheme to have been either complete or
Bernoulli randomization of students, separately by school. (Later we'll change this to the more realistic assumption of complete or Bernoulli randomization, separately by school *and* year of entry into the study.) Accordingly, experimental blocks are given by the `school_at_entry` variable.

```{r}
length(unique(STARplus$school_at_entry))
head(table(STARplus$school_at_entry))
```

We need a unique identifier for the unit by which treatment was allocated.  STAR treatments were assigned to students, as opposed to classrooms or other aggregates of students; accordingly we need a unique identifier per student. The  `stdntid` variable fills this role.

```{r}
length(unique(STARplus$stdntid))
head(STARplus$stdntid)
```

## A Basic Example

### Defining the `StudySpecification`

The three `_spec` functions (`rct_spec()`, `obj_spec()`, and `rd_spec()`)
operate similarly. The first argument is the most important, and encodes all the
specification information through the use of a formula. The left-hand side of
the formula identifies the treatment variable. The right-hand side of the
formula consists of the following potential pieces of information:

1. `unit_of_assignment()`: This identifies the variable(s) which indicate the
   units of assignment. This is required for all specifications. The alias
   `uoa()` can be used in its place.
2. `block()`: The identifies the variable(s) which contain block information.
   Optional.
3. `forcing()`: In regression discontinuity specifications (`rd_spec()`), this
   identifies the variable(s) which contain forcing information.

To define a `StudySpecification` in our example:

```{r}
spec <- obs_spec(cond_small ~ unit_of_assignment(stdntid) + block(school_at_entry),
                  data = STARplus, na.fail = FALSE)
summary(spec)
```

Should additional variables be needed to identify the unit of assignment,
block, or forcing, they can be included. Had the variable identifying the years in which participants entered the study been available, for example, we might have identified the experimental blocks as combinations of school and year of study entry, rather than as the school alone.  To do this we would pass `block(year_at_entry, school_at_entry)`, rather than `block(school_at_entry)`, in the above.

### Estimating the treatment effect

The main function for estimating treatment effects is the `lmitt()` function. It
takes in three main required arguments:

1. A formula specifying the outcome and the desired treatment effect.
2. The data set containing the outcome information.
3. A [`StudySpecification`](#defining-the-specification).

Note that the data set does **not** need to be the same data set which generated
the `StudySpecification`; it does however need to include the same variables to
identify the units of assignment. (If the variable names differ, the `by=`
argument can be used to link them, though we recommend renaming to reduce the
likelihood of issues.)

For example, you may have one dataset containing school-level information, and a
separate dataset containing student-level information. Assume school is the unit
of assignment. While you could of course merge those two data-sets,
**propertee** can instead use the school-level data to define the
`StudySpecification`, and the student-level data to estimate the treatment
effect.

The formula entering `lmitt()` can take on one of two forms:

```r
y ~ 1
```

will estimate the main treatment effect on outcome variable `y`, and

```r
y ~ x
```

will estimate subgroup specific treatment effects for each level of `x` for the
outcome `y`, if `x` is categorical. For continuous `x`, a main effect and a
treatment-`x` interaction effect is estimated.

Therefore, to estimate the treatment effect in our example, we can run:

```{r}
te <- lmitt(g1treadss ~ 1, data = STARplus, specification = spec)
summary(te)
```

The data includes ethnicity; we can estimate subgroup effects by ethnicity:

```{r}
te_s <- lmitt(g1treadss ~ race, data = STARplus, specification = spec)
summary(te_s)
```

### Including specification weights

Study specification weights can be easily included in this estimation.
**propertee** supports average treatment effect (ATE) and effect of the
treatment on the treated (ETT) weights.

To include one of the weights, simply include the `weights = "ate"` or
`weights = "ett"` argument to `lmitt()`:

```{r}
lmitt(g1treadss ~ 1, data = STARplus, specification = spec, weights = "ate")
lmitt(g1treadss ~ 1, data = STARplus, specification = spec, weights = "ett")
```

Internally, these call the `ate()` or `ett()` functions which can be used
directly.

```{r}
head(ate(spec, data = STARplus))
```

When included inside `lmitt()`, you do not need to specify any additional
arguments to `ate()` or `ett()`, enabling easy functions of weights. For example
if you had some other weight variable, say `wgt`, you could include `weights =
wgt*ate()` in the `lmitt()` call.

### Covariance Adjustment models

By itself, `lmitt()` does not allow for other covariates; e.g. something like
`lmitt(y ~ 1 + control_var,...` will fail. To adjust for covariates, a separate
covariate model should be fit. Any model which supports a `predict()` function
should work.

```{r}
camod <- lm(g1treadss ~ gender * dob + race, data = STARplus)
```

The `cov_adj()` function can be used to process the covariance adjustment model
and produce the required values; and its output can be passed as an `offset=`.

```{r}
lmitt(g1treadss ~ 1, data = STARplus, specification = spec,
      weights = "ate", offset = cov_adj(camod))
```

Similarly to the weight functions, `cov_adj()` attempts to locate the correct
arguments (in this case, mainly the `data=` argument) to use in the model
command; while `cov_adj()` does fall back to using the data which is in the
covariance model, its safer to use the `newdata=` argument if calling
`cov_adj()` outside of the model.

```{r}
head(cov_adj(camod, newdata = STARplus))
```

Also, similarly to weights, `cov_adj()` can be used in normal modeling commands
as well.

```{r}
lm(g1treadss ~ cond_small, data = STARplus, weights = ate(spec),
   offset = cov_adj(camod))
```

### Absorbing Blocks

If fixed effects for blocks are desired, which can be absorbed away to avoid
estimating, the `absorb=TRUE` argument can be passed.

<!-- currently not working due to #210, thus turned off-->
```{r, eval=FALSE}
lmitt(g1treadss ~ 1, data = STARplus, specification = spec, absorb = TRUE)
```