[Rd] Bugs/issues with model.tables() (PR#8275)

Wed Nov 2 07:44:03 CET 2005

Based on what follows, the most favourable construction is that
the documentation, by failing to say that the function should be
used only for completely balanced designs, is deficient.  Even
for such designs, there is a bug that needs correction. The
discussion is lengthy because I think it important to document,
at least wrt effects and means, exactly what model.tables()
does.

My preference is to pension model.tables() off, and to replace
it with the function that I describe (but do not, at this point,
reproduce) below.  Basically, for getting the means or effects,
my function replaces the call to proj() with a call to
predict.lm(.., type="terms"). The SEs for effects can also be
obtained using predict.lm().  For getting SEs of differences of
treatment means, the SEs that appear from summary.lm() can be
used.  Such a function will apply quite generally to models that
include only main effects.  This is surely a matter for
discussion and comment, especially as model.tables() has been
around since nearly the beginning of R time.

Summary
~~~~~~~
Under "Description:" there is the claim:
      Computes summary tables for model fits, especially complex 'aov'
      fits.

The warning that appears some way further down is perhaps intended
to weaken the force of this:

      The implementation is incomplete, and only the simpler cases have
      been tested thoroughly.

The implementation is useful only for balanced complete designs.
The documentation should say this.

In (I), I summarize issues with model.tables()
In (II), I give examples, which can be extended to verify points
that the output below does not specifically illustrate.

I (a) Even for balanced designs, the SEs for the effects are wrong.
I give an example below.

I (b) model.tables() works from a sequential breakdown of the
predicted values. In a model with factors A and B, with A fitted
first, model.tables() calculates effects for A, unadjusted for B,
then effects for B after adjusting for A.  If every combination
of A and B occurs equally often (a completely balanced design),
the order makes no difference.  If the design is not completely
balanced, then the order does matter, and the "effects" will not
be unbiased estimates of the treatment effects.  Nor will their
differences be unbiased estimates of the treatment differences.
This is true even for balanced incomplete block designs.

Means are obtained by adding "effects", calculated as just described,
to the overall mean.

The only hint on the help page that so-called "effects" and
"means" relate to such a sequential breakdown of predicted values
is the reference, under "See also", to proj. Under help(proj), it
is written:

      Projection matrices from the default
      method have orthogonal columns representing the projection of the
      response onto the column space of the Q matrix from the QR
      decomposition.

For those who know about QR, this immediatly flags a sequential
breakdown of the fitted values.

I (c) Standard errors for effects seeem to have in mind a formula
that (with a correction that will be noted below) applies only in
the completely balanced case.

I (d) Standard errors for differences of means (SEDs) seem to be
correct for completely balanced designs.  For BIB designs they
are independent of the order of terms, even though the "effects"
and "means" are not. They are in general wrong, even for the
"means" as given.

A correct order independent breakdown of the fitted values can be
obtained by replacing, in model.tables(), the call to proj() with
a call to predict.lm(.., type="terms").  [This requires a bit
more than a simple replacement.]  The means and effects are,
quite generally, the means and expects that most users will want.
Once I have worked over the code with some care, and sorted out
the caculation of SEs, I will offer this function for inclusion
in R's stats package.  Or I am happy to provide the current
draft of my function to anyone who wants to look at it.

II: Further Details, and Examples:

(a) For calculations of SEs for effects, model.tables.aov() calls
se.aov(), which uses the formula:
   s/sqrt(n)  [the mean is over n subunits]

The formula required, in the completely balanced case, is
   sqrt((m-1)/m)*s/sqrt(n) formula
[an effect is a mean over n subunits, minus the grand mean;
the grand mean is a mean over mn units.]

Often, m is large enough that the difference is of no consequence.
Also, the SEs of effects are commonly of little more than academic
interest.  But, if given at all, they should be correct.

Example
"bal1" <-
   structure(list(y = c(3.1, 2.4, 2.2, 2.3, 1.9, 0.9, 1.8, 1.1),
                  trt = structure(as.integer(c(1, 1, 2, 2, 3, 3, 4, 4)),
                    .Label = c("a", "b", "c", "d"), class = "factor")),
             .Names = c("y", "trt"),
             row.names = c("1", "2", "3", "4", "5", "6", "7", "8"),
             class = "data.frame")
 > model.tables(bal1.aov, type="effects", se=TRUE)$se
       trt
0.3526684
 > summary.lm(u)$sigma/sqrt(2)*sqrt(3/4)  # m=4
[1] 0.3054198
"
 > unique(predict.lm(bal1.aov, type="terms", se=TRUE)$se)
         trt
1 0.3054198

II (d) In the interests of brevity (sic!), I will limit attention
to means:

"bdes" <-
   structure(list(trt = structure(as.integer(c(1, 2, 1, 3, 1, 4,
                    2, 3, 2, 4, 3, 4)), .Label = c("a", "b", "c", "d"),
                    class = "factor"),
                  blk = structure(as.integer(c(1, 1, 2, 2, 3, 3, 4,  
4, 5, 5,
                    6, 6)), .Label = c("A", "B", "C", "D", "E", "F"),
                    class = "factor"),
                  y = c(0.8, -1.1, 4.5, 3.3, 4.3, 4.9, 0.6, 3.9, 4.6,  
9.4,
                    3.7, 5.7)), .Names = c("trt", "blk", "y"),
             row.names = c("1","2", "3", "4", "5", "6", "7", "8",  
"9", "10",
               "11", "12"), class = "data.frame")
 > # Crude block means; these are meaningless
 > tapply(bdes$y, bdes$blk, mean)
     A     B     C     D     E     F
-0.15  3.90  4.60  2.25  7.00  4.70
 > # Crude treatment means
 > tapply(bdes$y, bdes$trt, mean)
        a        b        c        d
3.200000 1.366667 3.633333 6.666667
 > ## aov fits
 > bdes.aov <- aov(y~blk+trt, data=bdes)  # Blocks first
 > bdes_trtFirst.aov <- aov(y~trt+blk, data=bdes) # trt first
 > model.tables(bdes.aov, type="means")[["table"]][["blk"]]
blk
     A     B     C     D     E     F
-0.15  3.90  4.60  2.25  7.00  4.70
 >   # Observe that these agree with the above crude block means
 > model.tables(bdes_trtFirst.aov, type="means")[["table"]][["trt"]]
trt
        a        b        c        d
3.200000 1.366667 3.633333 6.666667
 >
 > ## Treatment means, when blocks are taken first
 > model.tables(bdes.aov, type="means")[["table"]][["trt"]]
trt
        a        b        c        d
4.133333 2.050000 3.733333 4.950000
 > ## Also, note their differences
 > diff(model.tables(bdes.aov, type="means")[["table"]][["trt"]])
trt
         b         c         d
-2.083333  1.683333  1.216667

 > ## Treatment means, from the "usual" least squares analysis
 > dummy.coef(bdes.aov)[["(Intercept)"]]
(Intercept)
      1.4125
 > dummy.coef(bdes.aov)[["(Intercept)"]]+mean(dummy.coef(bdes.aov)$blk)+
+ dummy.coef(bdes.aov)$trt
        a        b        c        d
4.341667 1.216667 3.741667 5.566667
 > diff(dummy.coef(bdes.aov)$trt)
      b      c      d
-3.125  2.525  1.825
 > diff(model.tables(bdes.aov, type="means")[["table"]][["trt"]])/
+ diff(dummy.coef(bdes.aov)$trt)
trt
         b         c         d
0.6666667 0.6666667 0.6666667
 >  # This factor is some simple function of the BIB design parameters,
 >  # which I am too lazy or too busy to work out.

--please do not edit the information below--

Version:
platform = powerpc-apple-darwin7.9.0
arch = powerpc
os = darwin7.9.0
system = powerpc, darwin7.9.0
status =
major = 2
minor = 2.0
year = 2005
month = 10
day = 06
svn rev = 35749
language = R

Locale:
C

Search Path:
.GlobalEnv, cuckoohosts, file:~/r/ch2/.RData, file:../.RData,  
package:methods, package:stats, package:graphics, package:grDevices,  
package:utils, package:datasets, Autoloads, package:base

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.