[R] Unbalanced data in split-plot analysis with aov()

Wed Oct 11 11:55:49 CEST 2017

> On 11 Oct 2017, at 00:07 , Samuel Knapp <samuel.knapp at tum.de> wrote:
> 
> Dear all,
> 
> I'm analysing a split-plot experiment, where there are sometimes one or 
> two values missing. I realized that if the data is slightly unbalanced, 
> the effect of the subplot-treatment will also appear and be tested 
> against the mainplot-error term.
> 
> I replicated this with the Oats dataset from Yates (1935), contained in 
> the nlme package, where Variety is on mainplot, and nitro on subplot.
> 
>> # Oats dataset (Yates 1935) from the nlme package
>> require(nlme); data <- get(data(Oats))
>> data$nitro <- factor(data$nitro);data$Block <- 
> as.factor(as.character(data$Block))
>> nrow(data) # 6 Blocks * 4 N-levels * 3 Varieties = 72 obs -> 
> orthogonal and balanced
> [1] 72
>> # split-plot anova
>> summary(aov(yield ~ Block+Variety*nitro + Error(Block/Variety),data))
> 
> Error: Block
>       Df Sum Sq Mean Sq
> Block  5  15875    3175
> 
> Error: Block:Variety
>           Df Sum Sq Mean Sq F value Pr(>F)
> Variety    2   1786   893.2   1.485  0.272
> Residuals 10   6013   601.3
> 
> Error: Within
>               Df Sum Sq Mean Sq F value   Pr(>F)
> nitro          3  20020    6674  37.686 2.46e-12 ***
> Variety:nitro  6    322      54   0.303    0.932
> Residuals     45   7969     177
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> reduceddata <- data[-5,] # delete one observation
>> summary(aov(yield ~ Block+Variety*nitro + 
> Error(Block/Variety),reduceddata))
> 
> Error: Block
>       Df Sum Sq Mean Sq
> Block  5  16070    3214
> 
> Error: Block:Variety
>           Df Sum Sq Mean Sq F value Pr(>F)
> Variety    2   1820   910.0   1.373  0.302
> nitro      1      1     1.0   0.001  0.970
> Residuals  9   5964   662.7
> 
> Error: Within
>               Df Sum Sq Mean Sq F value   Pr(>F)
> nitro          3  19766    6589   36.88 4.49e-12 ***
> Variety:nitro  6    333      55    0.31    0.928
> Residuals     44   7860     179
> 
> Although I fully understand, that mixed-models should be preferred for 
> non-orthogonal/unbalanced data, I think it's still valid to use the 
> aov() approach, when only 1 or 2 datapoints are missing. Also, I don't 
> really understand, why the structure of the ANOVA changes suddenly. Is 
> there any argument, I could supply to change this behaviour?

The answer to this is in the math, and not really possible to cover here. 

If things are non-orthogonal, you don't have the pretty separation of effects into error strata and one effect of that is that you can see the same effect occurring multiple time with different error terms. (In a plain block design, if not all treatments occur equally often in each block, the block means contain information about the treatment effects if block effects are considered random). 

Worse, if the balanced structure of the design is broken, the orthogonal decomposition of random effects does not correspond to the covariance structure that you think you are assuming. The consequences of this require considerable expertise to figure out. (I forget the details, but the nature is like having to assume variance of random effects to depend on block size.)

One known-valid way of dealing with a small number of missing values, is to retain the design, set the value of the missing values to something arbitrary, and include a dummy variable for each missing observation which is 1 for the position of that value and zero elsewhere. You then still get the effect of estimating the dummy coefficient in multiple strata corresponding to an extended model where the effect of the dummy is different for block means than for residuals. However, at least you then know fairly well what is going on. 

So, something like this (removing block from the fixed effects because you also had is as a random effect):

> i5 <- rep(0,72); i5[5] <- 1
> summary(aov(yield ~ i5 + nitro*Variety + Error(Block/Variety),data))

Error: Block
          Df Sum Sq Mean Sq F value  Pr(>F)   
i5         1  14163   14163   33.08 0.00453 **
Residuals  4   1713     428                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: Block:Variety
          Df Sum Sq Mean Sq F value Pr(>F)
i5         1     26    26.0   0.039  0.847
Variety    2   1809   904.7   1.365  0.304
Residuals  9   5964   662.7               

Error: Within
              Df Sum Sq Mean Sq F value   Pr(>F)    
i5             1    352     352   1.971    0.167    
nitro          3  19766    6589  36.885 4.49e-12 ***
nitro:Variety  6    333      55   0.310    0.928    
Residuals     44   7860     179                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You will notice that in this analysis, you can modify the value for yield[5] at will, and still get the same output, except of course for the "i5" entries. Also, quite interestingly, this reproduces the analysis with the 5th obs removed, _provided_ that you put nitro before Variety in the model spec: 

> summary(aov(yield ~ nitro*Variety + Error(Block/Variety),data[-5,]))

Error: Block
          Df Sum Sq Mean Sq F value  Pr(>F)   
nitro      1  14357   14357   33.53 0.00442 **
Residuals  4   1713     428                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: Block:Variety
          Df Sum Sq Mean Sq F value Pr(>F)
nitro      1     11    11.5   0.017  0.898
Variety    2   1809   904.7   1.365  0.304
Residuals  9   5964   662.7               

Error: Within
              Df Sum Sq Mean Sq F value   Pr(>F)    
nitro          3  19766    6589   36.88 4.49e-12 ***
nitro:Variety  6    333      55    0.31    0.928    
Residuals     44   7860     179                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I suppose this means that the nitro term adjusts the analyses in the lower error strata in the same way as the missing value indicator, but I really can't explain the effect in detail.

-pd

> 
> When I do the same with lm() and subsequent anova(), and calculate 
> F-value for Variety by hand, the estimates are still quite robust.
> 
> Best regards,
> 
> Sam
> 
> 
> -- 
> Samuel Knapp
> 
> Lehrstuhl für Pflanzenernährung
> Technische Universität München
> (Chair of Plant Nutrition
> Technical University of Munich)
> 
> Emil-Ramann-Strasse 2
> D-85354 Freising
> 
> Tel. +49 8161 71-3578
> samuel.knapp at tum.de
> www.researchgate.net/profile/Samuel_Knapp
> 
> -- 
> Samuel Knapp
> 
> Lehrstuhl für Pflanzenernährung
> Technische Universität München
> (Chair of Plant Nutrition
> Technical University of Munich)
> 
> Emil-Ramann-Strasse 2
> D-85354 Freising
> 
> Tel. +49 8161 71-3578	
> samuel.knapp at tum.de
> www.researchgate.net/profile/Samuel_Knapp
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com