[R] the effect of blocking on the size of confidence intervals - analysis using aov

Lars Kunert lkunert at mpi-sb.mpg.de
Mon Mar 16 23:24:10 CET 2009


Hi, I am a method developer in drug discovery.
I have developed a similarity searching method for small (drug-like)
molecules.
I want to compare the performance of my method to the performance of
other methods.

Similarity searching methods are commonly assessed by their ability to
(re)discover a
set of molecules that is avtive versus a given target, given one
randomly selected query
molecule.

Thus my dataset comprises,
a quantitative response variable, namely: "response", and
three categorical eplanatory variables, namely: "method", "target", and
"query".
The method variable has 4 levels, namely:  XXX.pg.trained,
XXX.rpg.trained, MACCS, and FTREES.
The target variable has 28 levels.
The query variable has 84 levels.
For every target three queries are selected. Thus "query" is nested in
"target".
The design is completely balanced.

"response" "method" "target" "query"
0.68   "XXX.pg.trained" "06243" "06243:218379"
0.96   "XXX.pg.trained" "06243" "06243:218384"
0.72   "XXX.pg.trained" "06243" "06243:246192"
0.7696 "XXX.pg.trained" "06245" "06245:235027"
...

For the complete dataset see the attached file data.txt

Because I am interested in comparing the actual levels of "method", I
model "method" as a fixed factor.
Because I consider both: the levels of "target" and the levels of
"query" to be randomly chosen samples of the overall distributations of
targets and queries, I model "target" and "query" as random factors.
An analysis of variance shows that both:
(1) the between target variance, and
(2) the within target and between query variance contribute in large
part to the overall variance.
Thus, differences between the different levels of "method", should be
emphasised by blocking on "target" or "query".

> m.aov = aov(formula = response ~ method + Error(target/query), data = d)
> summary( m.aov )

Error: target
          Df  Sum Sq Mean Sq F value Pr(>F)
Residuals 27 18.8539  0.6983

Error: target:query
          Df  Sum Sq Mean Sq F value Pr(>F)
Residuals 56 3.15927 0.05642

Error: Within
           Df Sum Sq Mean Sq F value   Pr(>F)
method      3 1.0891  0.3630  13.062 5.97e-08 ***
Residuals 249 6.9201  0.0278


I analyze three models:
> m0.aov = aov(formula = response ~ method -1, data = d)
> m1.aov = aov(formula = response ~ method -1 + Error(target), data = d)
> m2.aov = aov(formula = response ~ method -1 + Error(query), data = d)

I understand that the fitted parameters for the different levels of
"method" should be identical for all models. However the confidence
intervals of the paramters should decline in the oder m0, m1, and m3.

In Maindonalds book "Data Analysis and Graphics Using R" I found a
description, how confidence intervals of fitted parameters can be
derived from the corresponding standard errors (p. 154).

For the m0.aov model the approach is straight forward:

> summary.lm(m0.aov)

...
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
methodFTREES           0.34163    0.03221   10.61   <2e-16 ***
methodMACCS            0.45525    0.03221   14.13   <2e-16 ***
methodXXX.pg.trained   0.49498    0.03221   15.37   <2e-16 ***
methodXXX.rpg.trained  0.45049    0.03221   13.99   <2e-16 ***
...
Residual standard error: 0.2952 on 332 degrees of freedom
...

Thus the confidence interval for "methodFTREES" is given by:

0.34163 +- qt( 0.025, 332 )*0.03221 = ( 0.2782686, 0.4049914 )

However for the m1.aov and the m2.aov the call of summary.lm fails:

> summary.lm(m1.aov)

Error in if (p == 0) { : argument is of length zero

And, even more disturbingly, the fixed parameters of both models
derivate from the fixed parameters of the m0.aov model:

> coef( m1.aov )

target :
methodFTREES
    1.742346

Within :
        methodFTREES          methodMACCS methodXXX.pg.trained
        -0.108865476          0.004753571          0.044491667

Note that I added a "-1" to the model, in order to exclude the intercept
term. I added the "-1" in order to get four directly comparable
parameters. With the intercept term the I get the following result:


> coef(m1.aov)
(Intercept) :
(Intercept)
  0.4355866

target :
numeric(0)

Within :
          methodMACCS  methodXXX.pg.trained methodXXX.rpg.trained
            0.1136190             0.1533571             0.1088655

I have the following questions:
- Is the overall approach: use of blocking factors to reduce the
residual varaince, and hence to decrease the width of the confidence
intervals valid?
- Why does the call of summary.lm fail for m1.aov and m2.aov?
- Why are the fitted parameters different for m0.aov compared to m1.aov
and m2.aov?

Please note that I made similar attempts using lme and lmer. I failed in
both cases. In order to keep the story short I decided to post my
questions concerning lme and lmer in a follow-up mail.


Thanks for your help,

Lars






-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: data.txt
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090316/7b7d5845/attachment-0003.txt>


More information about the R-help mailing list