[R] Bootstrapping Confidence Intervals for Medians

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Jan 6 09:59:12 CET 2007


On Sat, 6 Jan 2007, gilbertg at musc.edu wrote:

> I apologize for this post. I am new to R (two days) and I have tried and 
> tried to calculated confidence intervals for medians. Can someone help 
> me?

Later, you say you want a confidence interval for a difference in medians, 
not the same thing.

For medians, see MASS4 section 5.7 for worked examples and discussion of 
the pitfalls.

> Here is my data:
>
> institution1
> 0.21
> 0.16
> 0.32
> 0.69
> 1.15
> 0.9
> 0.87
> 0.87
> 0.73
>
> The first four observations compose group 1 and observations 5 through 9 
> compose group 2. I would like to create a bootstrapped 90% confidence 
> interval on the difference of the medians (n2-n1). I have successfully 
> calculated a permutation test.
>
> This shouldn't be as difficult as I am making it, would someone please 
> enlighten me?


It seems to me to be much more difficult than you have made it.  We need 
to know exactly what you mean by

> a bootstrapped 90% confidence interval on the difference of the medians

The 'standard' theory of bootstrap confidence intervals as implemented in 
e.g. package 'boot' is for a single-sample problem (and it would be 
pushing its justification very hard to use this for n=9).  But you have 
two samples, and haven't told us how you intend to bootstrap.  I guess you 
mean a stratified bootstrap, sampling with replacement independently from 
observations 1-4 and 5-9.  I don't know of theory for bootstrap confidence 
intervals from that scenario: do you?

Beyond this, there are considerable problems with bootstrapping medians in 
small samples as the median is a non-smooth function of the data and the 
bootstrap samples take very few values.  See for example the galaxies 
dataset as discussed in MASS4.  For the stratified bootstrapping I 
referred to, there are only a handful of possible values of each of the 
medians and so the bootstrap distribution is a highly non-uniform one on a 
few values.  E.g.

x <- c(0.21, 0.16, 0.32, 0.69)
y <- c(1.15, 0.9, 0.87, 0.87, 0.73)
z <- numeric(10000)
for(i in seq_len(10000))
z[i] <- median(sample(x, replace=TRUE)) - median(sample(y, replace=TRUE))

  -0.99 -0.965  -0.94  -0.91 -0.885  -0.83  -0.74 -0.725 -0.715  -0.71
     33     70     83     27    134     64    129     16    259    317
   -0.7  -0.69 -0.685  -0.66 -0.645 -0.635  -0.63 -0.605  -0.58  -0.57
     43    370    711   1064     70    538    455   1388    424     29
  -0.55 -0.545  -0.52  -0.49 -0.475 -0.465  -0.46  -0.45 -0.445  -0.42
    905     57     79     41     54    119     28    183    146    436
  -0.41 -0.395 -0.365 -0.305  -0.28 -0.225  -0.21  -0.18  -0.04
    100    290    759     13     34     64    117    323     28

You could use that table to give 'basic' or 'percentile' confidence 
intervals, if you have reason to believe in them.


> Greg Gilbert, Faculty Research Associate
> Department of Biostatistics, Bioinformatics, & Epidemiology
> Medical University of South Carolina

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list