[R] Bootstrap inference for the sample median?

Mon Aug 31 13:05:44 CEST 2009

On 30-Aug-09 16:24:08, Emmanuel Charpentier wrote:
> Le dimanche 30 aoÃ»t 2009 Ã_ 18:43 +0530, Ajay Shah a Ã©crit :
>> Folks,
>> I have this code fragment:
>>   set.seed(1001)
>>   x <- c(0.79211363702017, 0.940536712079832, 0.859757602692931,
>>          0.82529998629531, 0.973451006822,    0.92378802164835,
>>          0.996679563355802,0.943347739494445, 0.992873542980045,
>>          0.870624707845108,0.935917364493788)
>>   range(x)
>>   # from 0.79 to 0.996
>> 
>>   e <- function(x,d) {
>>     median(x[d])
>>   }
>> 
>>   b <- boot(x, e, R=1000)
>>   boot.ci(b)
>> 
>> The 95% confidence interval, as seen with `Normal' and `Basic'
>> calculations, has an upper bound of 1.0028 and 1.0121.
>> 
>> How is this possible? If I sample with replacement from values which
>> are all lower than 1, then any sample median of these bootstrap
>> samples should be lower than 1. The upper cutoff of the 95% confidence
>> interval should also be below 1.
> 
> Nope. "Normal" and "Basic" try to adjust (some form of) normal
> distribution to the sample of your statistic. But the median of such a
> small sample is quite far from a normal (hint : it is a discrete
> distribution with only very few possible values, at most as many value
> as the sample. Exercise : derive the distribution of median(x)...).
> 
> To convince yourself, look at the histogram of the bootstrap
> distribution of median(x). Contrast with the bootstrap distribution of
> mean(x). Meditate. Conclude...
> HTH,
>                                       Emmanuel Charpentier

Ajay,
You have not said why you are adopting the bootstrap approach.
Maybe you specifically want to study the behaviour of bootstrap
methods, in which case my response below is irrelevant. But if,
on the other hand, you are simply looking for a confidence interval
for the median, you might consider a non-paramatric approach.

This -- which does not depend on distributional assumptions about
the data -- is based on the fact that if "med" denotes the median
of the distribution of a continuous variables X,

  Prob(X < med) = P(X > med) = 1/2

Hence, if X[1], X[2], ... , X[n] denote the values of a random
sample of X-values in increasing order, then Prob(X[1] > med) = 1/2^n,
and for r > 1:

  Prob( (X[r]>med) )
  = Prob( (X[1]>med) ) +
    Prob( (X[1]<med)&(X[2]>med) ) + ... +
    Prob( (X[1]<med)&(X[2]<med)&...&(X[r-1]<med)&(X[r]>med) )

i.e. terms summed over all the r disjoint ways in which "X[r]>med"
can occur. These terms are all straightforward binomial probabilities,
with p = 1/2, so

  Prob( (X[r]>med) )
  = (1/2^n) + n*(1/2^n) + ... + choose(n,(r-1))*(1/2^n)
  = pbinom((r-1),n,1/2)

Hence if, for instance, you find the maximum value of r such that
for given n:

  Prob( (X[r]>med) < = 0.025

then the probability is at least 0.975 that "X[r] < med" whatever
the value of "med" may be. Hence the element of the sorted sample
in the r-th position is a lower 97.5% one-sided confidence limit
for "med". Because you are looking at the median, the situation
is symmetrical, i.e. Prob(X > med) = Prob(X < med) = 1/2, so a
corresponding 97.5% upper one-sided confidence limit for "med" is
the element of the sorted sample in the (n+1-r)-th position.
Hence the pair of these two limits is a 95% confidence interval
for the median.

Now, since your (admittedly small) sample size is n=11, you can
get at it as follows:

Ps <- pbinom(q=(0:11), n=11, p=1/2)
round(rbind((0:11),Ps,rev(Ps)), 3)

[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
   0 0.006 0.033 0.113 0.274 0.500 0.726 0.887 0.967 0.994 1.000     1
   1 1.000 0.994 0.967 0.887 0.726 0.500 0.274 0.113 0.033 0.006     0

Hence the maximum value of r from the first line is r=2, so the second
of 11 sorted observations is the lower limit, and the 10th is the
upper limit, for a 95% confidence interval. In the case of your data,

  x <- c(0.79211363702017, 0.940536712079832, 0.859757602692931,
         0.82529998629531, 0.973451006822,    0.92378802164835,
         0.996679563355802,0.943347739494445, 0.992873542980045,
         0.870624707845108,0.935917364493788)
  print(sort(x)[c(2,10)],15)
  # [1] 0.825299986295310 0.992873542980045

Note that this is a conservative confidence interval, in that it
is guaranteed that the confidence level

  Prob(LowerLimit < median < UpperLimit ) > 0.95

and it is the shortest such interval with symmetry. As it happens,
with n=11, this probability is

  1 - 2*pbinom(1,11,1/2) =  0.9882812

so it is very conservative (a consequence of small n).

You could get a less conservative interval by using an asymmetrical
interval, e.g. the 2nd and 9th, or the 3rd and 10th, when the
probability would be

  1 - pbinom(1,11,1/2) - pbinom(2,11,1/2) = 0.9614258

which is pretty close to the "target confidence level" while still
not being less than the target, but then you have to have a reason
for preferring one ([2,9]) to the other [(3,10)]! Or you could choose
one of them at random ...

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 31-Aug-09                                       Time: 12:05:40
------------------------------ XFMail ------------------------------