[R] p-values from bootstrap - what am I not understanding?

Mon Apr 13 01:40:59 CEST 2009

Hi Johan,

Interesting question. I'm (trying) to write a lecture on this as we
speak. I'm no expert, but here are my two cents.

I think that your method works fine WHEN the sampling distribution
doesn't change its variance or shape depending on where it's centered.
Of course, for normally, t-, or chi-square distributed statistics,
this is the case, which is why it's fine to do this using traditional
statistical methods. However, there are situations where this might
not be the case (e.g., there may be a mean-variance relationship), and
since we would like a general method of getting valid p-values that
doesn't depend on strong assumptions, this probably isn't the way to
go. Permutation would seem to work better because you are simulating
the null process. However, figuring out how to permute the data in a
way that creates the null you want while retaining all the
dependencies, missingness patterns, etc in your data can be
difficult/impossible.

Hope that helps...

Matt

On Sun, Apr 12, 2009 at 4:38 PM, Peter Dalgaard
<p.dalgaard at biostat.ku.dk> wrote:
> Johan Jackson wrote:
>>
>> Dear stats experts:
>> Me and my little brain must be missing something regarding bootstrapping.
>> I
>> understand how to get a 95%CI and how to hypothesis test using
>> bootstrapping
>> (e.g., reject or not the null). However, I'd also like to get a p-value
>> from
>> it, and to me this seems simple, but it seems no-one does what I would
>> like
>> to do to get a p-value, which suggests I'm not understanding something.
>> Rather, it seems that when people want a p-value using resampling methods,
>> they immediately jump to permutation testing (e.g., destroying
>> dependencies
>> so as to create a null distribution). SO - here's my thought on getting a
>> p-value by bootstrapping. Could someone tell me what is wrong with my
>> approach? Thanks:
>>
>> STEPS TO GETTING P-VALUES FROM BOOTSTRAPPING - PROBABLY WRONG:
>>
>> 1) sample B times with replacement, figure out theta* (your statistic of
>> interest). B is large (> 1000)
>>
>> 2) get the distribution of theta*
>>
>> 3) the mean of theta* is generally near your observed theta. In the same
>> way
>> that we use non-centrality parameters in other situations, move the
>> distribution of theta* such that the distribution is centered around the
>> value corresponding to your null hypothesis (e.g., make the distribution
>> have a mean theta = 0)
>>
>> 4) Two methods for finding 2-tailed p-values (assuming here that your
>> observed theta is above the null value):
>> Method 1: find the percent of recentered theta*'s that are above your
>> observed theta. p-value = 2 * this percent
>> Method 2: find the percent of recentered theta*'s that are above the
>> absolute value of your observed value. This is your p-value.
>>
>> So this seems simple. But I can't find people discussing this. So I'm
>> thinking I'm wrong. Could someone explain where I've gone wrong?
>
>
> There's nothing particularly wrong about this line of reasoning, or at least
> not (much) worse than the calculation of CI. After all, one definition of a
> CI at level 1-alpha is that it contains values of theta0 for which the
> hypothesis theta=theta0 is accepted at level alpha. (Not the only possible
> definition, though.)
>
> The crucial bit in both cases is the assumption of approximate translation
> invariance, which holds asymptotically, but maybe not well enough in small
> samples.
>
> There are some braintwisters connected with the bootstrap; e.g., if the
> bootstrap distribution is skewed to the right, should the CI be skewed to
> the right or to the left? The answer is that it cannot be decided based on
> the distribution of theta* alone since that depends only on the true theta,
> and we need to know what the distribution would have been had a different
> theta been the true one.
>
> The point is that these things get tricky, so most people head for the safe
> haven of permutation testing, where it is rather more easy to feel that you
> know what you are doing.
>
> For a rather different approach, you might want to look into the theory of
> empirical likelihood (book by Art Owen, or just Google it).
>
> --
>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com