[R] Permutation or Bootstrap to obtain p-value for one sample

Tim Hesterberg timhesterberg at gmail.com
Sun Oct 9 17:23:35 CEST 2011


I'll concur with Peter Dalgaard that 
* a permutation test is the right thing to do - your problem is equivalent
  to a two-sample test,
* don't bootstrap, and
* don't bother with t-statistics
but I'll elaborate a bit on on why, including 
* two approaches to the whole problem - and how your approach relates
  to the usual approach,
* an interesting tidbit about resampling t statistics.

First, I'm assuming that your x variable is irrelevant, only y matters,
and that sample1 is a proper subset of df.  I'll also assume that you
want to look for differences in the mean, rather than arbitrary differences
(in which case you might use e.g. a Kolmogorov-Smirnov test).

There are two general approaches to this problem:
(1) two-sample problem, sample1$y vs df$y[rows other than sample 1]
(2) the approach you outlined, thinking of sample1$y as part of df$y.

Consider (1), and call the two data sets y1 and y2
The basic permutation test approach is
* compute the test statistic theta(y1, y2), e.g. mean(y1)-mean(y2)
* repeat 9999 (or 99999) times:
  draw a sample of size n1 from the pooled data, call that Y1, call the rest Y2
  compute theta(Y1, Y2)
* P-value for a one-sided test is (1 + k) / (1 + 9999)
  where k is the number of permutation samples with theta(Y1,Y2) >= theta(y1,y2)

The test statistic could be
  mean(y1) - mean(y2)
  mean(y1)
  sum(y1)
  t-statistic (pooled variance)
  P-value for a t-test (pooled variance)
  mean(y1) - mean(pooled data)
  t-statistic (unpooled variance)
  P-value for a t-test (unpooled variance)
  median(y1) - median(y2)
  ...
The first six of those are equivalent - they give exactly the same P-value
for the permutation test.  The reason is that those test statistics
are monotone transformations of each other, given the data.
Hence, doing the pooled-variance t calculations gains nothing.

Now consider your approach (2).  That is equivalent to the permutation
test using the test statistic:  mean(y1) - mean(pooled data).

Why not a bootstrap?  E.g. pool the data and draw samples of size
n1 and n2 from the pooled data, independently and with replacement.
That is similar to the permutation test, but less accurate.  Probably
the easiest way to see this is to suppose there is 1 outlier in the pooled data.
In any permutation iteration there is exactly 1 outlier among the two samples.
With bootstrapping, there could be 0, 1, 2, ....
The permutation test answers the question - given that there is exactly
1 outlier in my combined data, what is the probability that random chance
would give a difference as large as I observed.  The bootstrap would
answer some other question.

Tim Hesterberg
NEW!  Mathematical Statistics with Resampling and R, Chihara & Hesterberg
http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8
http://home.comcast.net/~timhesterberg
 (resampling, water bottle rockets, computers to Guatemala, shower = 2650 light bulbs, ...)


>On Oct 8, 2011, at 16:04 , francy wrote:
>
>> Hi,
>>
>> I am having trouble understanding how to approach a simulation:
>>
>> I have a sample of n=250 from a population of N=2,000 individuals, and I
>> would like to use either permutation test or bootstrap to test whether this
>> particular sample is significantly different from the values of any other
>> random samples of the same population. I thought I needed to take random
>> samples (but I am not sure how many simulations I need to do) of n=250 from
>> the N=2,000 population and maybe do a one-sample t-test to compare the mean
>> score of all the simulated samples, + the one sample I am trying to prove
>> that is different from any others, to the mean value of the population. But
>> I don't know:
>> (1) whether this one-sample t-test would be the right way to do it, and how
>> to go about doing this in R
>> (2) whether a permutation test or bootstrap methods are more appropriate
>>
>> This is the data frame that I have, which is to be sampled:
>> df<-
>> i.e.
>> x y
>> 1 2
>> 3 4
>> 5 6
>> 7 8
>> . .
>> . .
>> . .
>> 2,000
>>
>> I have this sample from df, and would like to test whether it is has extreme
>> values of y.
>> sample1<-
>> i.e.
>> x y
>> 3 4
>> 7 8
>> . .
>> . .
>> . .
>> 250
>>
>> For now I only have this:
>>
>> R=999 #Number of simulations, but I don't know how many...
>> t.values =numeric(R)	 #creates a numeric vector with 999 elements, which
>> will hold the results of each simulation.
>> for (i in 1:R) {
>> sample1 <- df[sample(nrow(df), 250, replace=TRUE),]
>>
>> But I don't know how to continue the loop: do I calculate the mean for each
>> simulation and compare it to the population mean?
>> Any help you could give me would be very appreciated,
>> Thank you.
>
>The straightforward way would be a permutation test, something like this
>
>msamp <- mean(sample1$y)
>mpop <- mean(df$y)
>msim <- replicate(10000, mean(sample(df$y, 250)))
>
>sum(abs(msim-mpop) >= abs(msamp-mpop))/10000
>
>I don't really see a reason to do bootstrapping here. You say you want to test whether your sample could be a random sample from the population, so just simulate that sampling (which should be without replacement, like your sample is). Bootstrapping might come in if you want a confidence interval for the mean difference between your sample and the rest.
>
>Instead of sampling means, you could put a full-blown t-test inside the replicate expression, like:
>
>psim <- replicate(10000, {s<-sample(1:2000, 250); t.test(df$y[s], df$y[-s])$p.value})
>
>and then check whether the p value for your sample is small compared to the distribution of values in psim.
>
>That'll take quite a bit longer, though; t.test() is a more complex beast than mean(). It is not obvious that it has any benefits either, unless you specifically wanted to investigate the behavior of the t test.
>
>(All code untested. Caveat emptor.)
>
>
>--
>Peter Dalgaard, Professor,
>Center for Statistics, Copenhagen Business School
>Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>Phone: (+45)38153501
>Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list