[R] Bootstrapping Confidence Intervals for Medians
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Jan 6 09:59:12 CET 2007
On Sat, 6 Jan 2007, gilbertg at musc.edu wrote:
> I apologize for this post. I am new to R (two days) and I have tried and
> tried to calculated confidence intervals for medians. Can someone help
> me?
Later, you say you want a confidence interval for a difference in medians,
not the same thing.
For medians, see MASS4 section 5.7 for worked examples and discussion of
the pitfalls.
> Here is my data:
>
> institution1
> 0.21
> 0.16
> 0.32
> 0.69
> 1.15
> 0.9
> 0.87
> 0.87
> 0.73
>
> The first four observations compose group 1 and observations 5 through 9
> compose group 2. I would like to create a bootstrapped 90% confidence
> interval on the difference of the medians (n2-n1). I have successfully
> calculated a permutation test.
>
> This shouldn't be as difficult as I am making it, would someone please
> enlighten me?
It seems to me to be much more difficult than you have made it. We need
to know exactly what you mean by
> a bootstrapped 90% confidence interval on the difference of the medians
The 'standard' theory of bootstrap confidence intervals as implemented in
e.g. package 'boot' is for a single-sample problem (and it would be
pushing its justification very hard to use this for n=9). But you have
two samples, and haven't told us how you intend to bootstrap. I guess you
mean a stratified bootstrap, sampling with replacement independently from
observations 1-4 and 5-9. I don't know of theory for bootstrap confidence
intervals from that scenario: do you?
Beyond this, there are considerable problems with bootstrapping medians in
small samples as the median is a non-smooth function of the data and the
bootstrap samples take very few values. See for example the galaxies
dataset as discussed in MASS4. For the stratified bootstrapping I
referred to, there are only a handful of possible values of each of the
medians and so the bootstrap distribution is a highly non-uniform one on a
few values. E.g.
x <- c(0.21, 0.16, 0.32, 0.69)
y <- c(1.15, 0.9, 0.87, 0.87, 0.73)
z <- numeric(10000)
for(i in seq_len(10000))
z[i] <- median(sample(x, replace=TRUE)) - median(sample(y, replace=TRUE))
-0.99 -0.965 -0.94 -0.91 -0.885 -0.83 -0.74 -0.725 -0.715 -0.71
33 70 83 27 134 64 129 16 259 317
-0.7 -0.69 -0.685 -0.66 -0.645 -0.635 -0.63 -0.605 -0.58 -0.57
43 370 711 1064 70 538 455 1388 424 29
-0.55 -0.545 -0.52 -0.49 -0.475 -0.465 -0.46 -0.45 -0.445 -0.42
905 57 79 41 54 119 28 183 146 436
-0.41 -0.395 -0.365 -0.305 -0.28 -0.225 -0.21 -0.18 -0.04
100 290 759 13 34 64 117 323 28
You could use that table to give 'basic' or 'percentile' confidence
intervals, if you have reason to believe in them.
> Greg Gilbert, Faculty Research Associate
> Department of Biostatistics, Bioinformatics, & Epidemiology
> Medical University of South Carolina
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list