# [R] Bootstrapping Confidence Intervals for Medians

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Jan 6 09:59:12 CET 2007

```On Sat, 6 Jan 2007, gilbertg at musc.edu wrote:

> I apologize for this post. I am new to R (two days) and I have tried and
> tried to calculated confidence intervals for medians. Can someone help
> me?

Later, you say you want a confidence interval for a difference in medians,
not the same thing.

For medians, see MASS4 section 5.7 for worked examples and discussion of
the pitfalls.

> Here is my data:
>
> institution1
> 0.21
> 0.16
> 0.32
> 0.69
> 1.15
> 0.9
> 0.87
> 0.87
> 0.73
>
> The first four observations compose group 1 and observations 5 through 9
> compose group 2. I would like to create a bootstrapped 90% confidence
> interval on the difference of the medians (n2-n1). I have successfully
> calculated a permutation test.
>
> This shouldn't be as difficult as I am making it, would someone please
> enlighten me?

It seems to me to be much more difficult than you have made it.  We need
to know exactly what you mean by

> a bootstrapped 90% confidence interval on the difference of the medians

The 'standard' theory of bootstrap confidence intervals as implemented in
e.g. package 'boot' is for a single-sample problem (and it would be
pushing its justification very hard to use this for n=9).  But you have
two samples, and haven't told us how you intend to bootstrap.  I guess you
mean a stratified bootstrap, sampling with replacement independently from
observations 1-4 and 5-9.  I don't know of theory for bootstrap confidence
intervals from that scenario: do you?

Beyond this, there are considerable problems with bootstrapping medians in
small samples as the median is a non-smooth function of the data and the
bootstrap samples take very few values.  See for example the galaxies
dataset as discussed in MASS4.  For the stratified bootstrapping I
referred to, there are only a handful of possible values of each of the
medians and so the bootstrap distribution is a highly non-uniform one on a
few values.  E.g.

x <- c(0.21, 0.16, 0.32, 0.69)
y <- c(1.15, 0.9, 0.87, 0.87, 0.73)
z <- numeric(10000)
for(i in seq_len(10000))
z[i] <- median(sample(x, replace=TRUE)) - median(sample(y, replace=TRUE))

-0.99 -0.965  -0.94  -0.91 -0.885  -0.83  -0.74 -0.725 -0.715  -0.71
33     70     83     27    134     64    129     16    259    317
-0.7  -0.69 -0.685  -0.66 -0.645 -0.635  -0.63 -0.605  -0.58  -0.57
43    370    711   1064     70    538    455   1388    424     29
-0.55 -0.545  -0.52  -0.49 -0.475 -0.465  -0.46  -0.45 -0.445  -0.42
905     57     79     41     54    119     28    183    146    436
-0.41 -0.395 -0.365 -0.305  -0.28 -0.225  -0.21  -0.18  -0.04
100    290    759     13     34     64    117    323     28

You could use that table to give 'basic' or 'percentile' confidence
intervals, if you have reason to believe in them.

> Greg Gilbert, Faculty Research Associate
> Department of Biostatistics, Bioinformatics, & Epidemiology
> Medical University of South Carolina

--
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

```