[R] Reference for aov()

Thu Jul 14 10:50:42 CEST 2016

I am not aware of a detailed documentation of this beyond the actual source code. 
However, the principles are fairly straightforward, except that the rules for constructing the design matrix from the model formula can be a bit arcane at times.

The two main tools are the design matrix constructor (model.matrix) and a Gram-Schmidt type ortogonalization of its columns (the latter is called a QR decomposition in R, which it is, but there are several algorithms for QR, and the linear models codes depend on the QR algorithm being based on orthogonalization - so LINPACK works and LAPACK doesn't).

An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out.  E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result.

What aov() does is that it first decomposes the observations according to the Error() term, forming the error strata, then fits the systematic part of the model to each stratum in turn. In the nice cases, each term of the model will be estimable in exactly one stratum, and part of the aov() logic is to detect and remove unestimable terms. E.g., if you have a balanced two way layout, say individual x treatment, the variable gender is a subfactor of individual, so Y ~ gender * treatment + Error(individual/treatment), the gender effect is estimated in the individual stratum, whereas treatment and gender:treatment are estimated in the individual:treatment stratum.

It should be noted that it is very hard to interpret the results of aov() unless the Error() part of the model corresponds to a balanced experimental design. Or put more sharply: The model implied by the decomposition into error strata becomes nonsensical otherwise. If you do have a balanced design, the error strata reduce to simple combinations of means and observation, so the aov() algorithm is quite inefficient, but to my knowledge nobody has bothered to try and do better.

-pd

> On 13 Jul 2016, at 18:18 , Justin Thong <justinthong93 at gmail.com> wrote:
> 
> Hi
> 
> *I have been looking for a reference to explain how R uses the aov
> command(at a deeper level)*. More specifically, how R reads the formulae
> and R computes the sums of squares. I am not interested in understanding
> what the difference of Type 1,2,3 sum of squares are. I am more interested
> in finding out about how R computes ~x1:x2:x3  or how R computes ~A:x1
> emphasizing sequential nature of the way it computes, and models even more
> complicated than this.
> 
> Yours sincerely,
> Justin
> 
> *I check my email at 9AM and 4PM everyday*
> *If you have an EMERGENCY, contact me at +447938674419(UK) or
> +60125056192(Malaysia)*
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com