[R] bootstrap

Mon Nov 12 16:33:32 CET 2007

On Mon, 12 Nov 2007, Stefano Ghirlanda wrote:

> i am using the boot package for some bootstrap calculations in place
> of anovas. one reason is that my dependent variable is distributed
> bimodally, but i would also like to learn about bootstrapping in
> general (i have ordered books but they have not yet arrived).
>
> i get the general idea of bootstrapping but sometimes i do not know
> how to define suitable statistics to test specific hypotheses.

That's a basic issue in statistics.  Bootstrapping is only another way to 
assess the variability of a pre-determined statistic (and incidentally is 
not much used for testing, more often for confidence intervals).

> two examples follow.
>
> 1) comparing the means of more than two groups. a suitable statistics
>   could be the sum of squared deviations of group means from the
>   grand mean. does this sound reasonable?

No.  That means nothing by itself, but needs to be compared to the 
residual variation (e.g. by an F statistic).

> 2) testing for interactions. e.g., i want to see whether an
>   independent variable has the same effect in two different
>   samples. in an anova this would be expressed as the significance,
>   or lack thereof, of the interaction between a "sample" factor and
>   another factor for the independent variable. how would i do this
>   with a bootstrap calculation?
>
> my problem with 2) is that when one fits a linear model to the data,
> from which sums of squares for the anova are calculated, the
> interaction between the two factors corresponds to many regression
> coefficients in the linear model (e.g., i actually have three samples
> and an independent variable with four levels). i do not know how to
> summarize these in a single statistics.

Any good book on statistics with R (e.g. MASS) would point you at the 
anova() function to compare models.

> i have seen somewhere that some people calculate F ratios
> nevertheless, but then test them against a bootstrapped distribution
> rather than against the F distribution. is this a sensible approach?
> could one also use sums of squares directly as the bootstrapped
> statistics?

It is not totally off the wall.  The problems are

- How you bootstrap, given that you don't have a single homogeneous group.
   You seem to want to test, so you need to emulate the null-hypothesis
   distribution.  The most common way to do that is to fit a model, find
   some sort of residuals, bootstrap those and use them as errors to
   simulate from the null hypothesis.  At that point you will have to work
   hard to convince many statisticians that you have improved over the
   standard theory or a simulation from a long-tailed error distribution.

- That if you don't believe in the normal distribution of your errors
   (and not response), you probably should not be using least-squares
   based statistical methodology.  And remember that the classical ANOVA
   tests are supported by permutation arguments which are very similar to
   the bootstrap (just sampling without replacement instead of with).

These points are discussed with examples in the linear models chapter of 
MASS (the book) and also in the Davison-Hinkley book which the 'boot' 
package supports.

[Shame about the broken shift key, although it seems to work with F: 
keyboards are really cheap to replace these days.]

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595