[R] Permutation or Bootstrap to obtain p-value for one sample

peter dalgaard pdalgd at gmail.com
Sun Oct 9 14:53:53 CEST 2011


On Oct 9, 2011, at 12:00 , francesca casalino wrote:

> Thank you very much to both Ken and Peter for the very helpful explanations. 
> 
> Just to understand this better (sorry for repeating but I am also new in statistics…so please correct me where I am wrong):
> 
> Ken' method:
> Random sampling of the mean, and then using these means to construct a distribution of means (the 'null' distribution), and I can then use this normal distribution and compare the population mean to my mean using, for example, z-score.
> Of note: The initial distributions are not normal, so I thought I needed to base my calculations on the median, but I can use the mean to construct a normal distribution. 
> This would be defined a bootstrap test.  
> 
> Peter's method:
> Random sampling of the mean, and then comparing each sampled mean with the population mean and see if it is higher or equal to the difference between my sample and the population mean. This is a permutation test, but to actually get CI and a p-value I would need bootstrap method. 
> 
> Did I understand this correctly? 
> 
> I tried to start with Ken's approach for now, and followed his steps, but: 
> 
> 1) I get a lot of NaN in the sampling distribution, is this normal? 
> 2) I think I am doing again something wrong when I try to find a p-value…This is what I did:
> 
> nreps=10000 
> mean.dist=rep(NA,nreps) 
> 
> for(replication in 1:nreps) 
> { 
> my.sample=sample(population$Y, 250, replace=F) 
> #Peter mentioned that this sampling should be without replacement, so I went for that---
> 
> mean.for.rep=mean(my.sample) #mean for this replication 
> mean.dist[replication]=mean.for.rep #store the mean 
> } 
> 
> hist(mean.dist,main="Null Dist of Means", col="chartreuse") 
>  #Show the means in a nifty color 
> 
> mean_dist= mean(mean.dist, na.rm=TRUE)
> sd_pop= sd(mean.dist, na.rm=TRUE)
> 
> mean_sample= mean(population$Y, na.rm=TRUE)
> 

You mean sample$Y, I hope.


> z_stat= (mean_sample - mean_dist)/(sd_pop/sqrt(2089))
> p_value= 2*pnorm(-abs(z_stat))
> 
> Is this correct? 

No. First, notice that replicate() does all the "red tape" of the for loop for you. It's not a grave error to do what you're doing, but I did give you code, so maybe you should try it...

Worse: The whole point of doing a permutation test is to compare the observed value directly to the actual permutation distribution. So you take your observed mean and see whether it falls in the middle or in the tails of the histogram. The p value can be estimated as the relative frequency of simulated means falling as far or further out in the tails as the observed one. 

You do not want to approximate the simulated distribution with a normal distribution. If that's what you wanted, you could do it with no simulation at all -- there are simple formulas relating the mean and variance of the sample mean to the mean and variance of the population, and you'd more or less wind up reinventing the two-sample t-test.


> THANK YOU SO MUCH FOR ALL YOUR HELP!! 
> 
> 2011/10/9 Ken Hutchison <vicvoncastle at gmail.com>
> Hi Francy, 
>   A bootstrap test would likely be sufficient for this problem, but a one-sample t-test isn't advisable or necessary in my opinion. If you use a t-test multiple times you are making assumptions about the distribution of your data; more importantly, your probability of Type 1 error will be increased with each test. So, a valid thing to do would be to sample (computation for this problem won't be expensive so do alotta reps) and compare your mean to the null distribution of means. I.E.
> 
> nreps=10000
> mean.dist=rep(NA,nreps)
> 
> for(replication in 1:nreps)
> {
> my.sample=sample(population, 2500, replace=T) 
> #replace could be false, depends on preference
> mean.for.rep=mean(my.sample) #mean for this replication
> mean.dist[replication]=mean.for.rep #store the mean
> }
> 
> hist(mean.dist,main="Null Dist of Means", col="chartreuse") 
>  #Show the means in a nifty color
> 
> You can then perform various tests given the null distribution, or infer from where your sample mean lies within the distribution or something to that effect. Also, if the distribution is normal, which is somewhat likely since it is a distribution of means: (shapiro.test or require(nortest) ad.test will let you know) you should be able to make inference from that using parametric methods (once) which will fit the truth a bit better than a t.test.
>         Hope that's helpful, 
>            Ken Hutchison
> 
> 
> On Sat, Oct 8, 2011 at 10:04 AM, francy <francy.casalino at gmail.com> wrote:
> Hi,
> 
> I am having trouble understanding how to approach a simulation:
> 
> I have a sample of n=250 from a population of N=2,000 individuals, and I
> would like to use either permutation test or bootstrap to test whether this
> particular sample is significantly different from the values of any other
> random samples of the same population. I thought I needed to take random
> samples (but I am not sure how many simulations I need to do) of n=250 from
> the N=2,000 population and maybe do a one-sample t-test to compare the mean
> score of all the simulated samples, + the one sample I am trying to prove
> that is different from any others, to the mean value of the population. But
> I don't know:
> (1) whether this one-sample t-test would be the right way to do it, and how
> to go about doing this in R
> (2) whether a permutation test or bootstrap methods are more appropriate
> 
> This is the data frame that I have, which is to be sampled:
> df<-
> i.e.
> x y
> 1 2
> 3 4
> 5 6
> 7 8
> . .
> . .
> . .
> 2,000
> 
> I have this sample from df, and would like to test whether it is has extreme
> values of y.
> sample1<-
> i.e.
> x y
> 3 4
> 7 8
> . .
> . .
> . .
> 250
> 
> For now I only have this:
> 
> R=999 #Number of simulations, but I don't know how many...
> t.values =numeric(R)     #creates a numeric vector with 999 elements, which
> will hold the results of each simulation.
> for (i in 1:R) {
> sample1 <- df[sample(nrow(df), 250, replace=TRUE),]
> 
> But I don't know how to continue the loop: do I calculate the mean for each
> simulation and compare it to the population mean?
> Any help you could give me would be very appreciated,
> Thank you.
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Permutation-or-Bootstrap-to-obtain-p-value-for-one-sample-tp3885118p3885118.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list