[R] R code and simulation suggestions

Fri Oct 19 23:48:09 CEST 2001

Hi,

I have a question about a simulation I've tried to write in R.  I'd
appreciate any suggestions for improvements in my R code or logic used to
construct the simulation.

Background:  I've recently come across a paper that cautions against using
the normal approximation to the binomial when P (population proportion)
is very small (or very large).  As an applied data analyst who is about to
do some work with differences between small p-hats (sample proportions), I
wanted to demonstrate to myself that even though the parent distributions
may not be well approximated by the normal distribution, the differences
between two proportions may be.  There may be an elegant proof of this (an
extension of the Central Limit Theorem?), but I wanted to see if I could
both demonstrate it the concept to myself and to sharpen my R usage to do
it.

Here's my strategy.  Assume two populations with population proportions of
.05.  Under the null hypothesis of no difference, the population
difference between proportions is 0.  Sample p-hats from these two
proportions repeatedly to create an empirical sampling distribution for
p-hat differences.  Bootstrap sample from the empirical sampling
distribution.  Find the mean p-hat difference for each bootstrap sample.
Then plot the resulting mean differences with a histogram, density plot,
and qq plot for visual inspection.  Test for normality just for kicks.

Here's my code:

### begin code

# an attempt to sample from two binomial distributions both
# with P=.05 to create a sampling distribution of p-hat1 - p-hat2
# differences

# create the binomial variable x and p(x)
source<-seq(0, 100, by=1)
px<-dbinom(source, 100, .05)

# sample from binomial distributions with probability px
# divide by 100 to create p-hats
# find 500 differences from randomly drawn p-hats with probability px
#  to create empircal sampling distribution for p-hat differences
dif4<-((sample(source, 500, replace=TRUE, px))/100 - (sample(source, 500,
        replace=TRUE, px))/100)

# get 5000 bootstrap samples from empirical sampling distribution for
p-hat differences
library(bootstrap)
sim4<-bootstrap(dif4, 5000, mean)

# pull out the p-hat difference statistics to plot
par(mfrow=c(2,2))
hist(sim4$thetastar, main="Histogram Prop<=.05, binomial draws")
plot(density(sim4$thetastar), main="Density Plot Prop<=.05, binomial
     draws")

# a significance test for normality
shapiro.test(sim4$thetastar)
qqnorm(sim4$thetastar, main="QQ Plot Prop<=.05, binomial draws")
plot(mean(sim4$thetastar), type="n", main="Density Plot Prop<=.05,
binomial draws")

### end code

This is my first attempt at something like this, so I'd consider any
suggestions as learning opportunities.  Does this make sense?  Does it
seem to do what I intend?  Are there faster/cleaner ways to do this?

I'm not sure it's relevant for this question but just in case:

platform i386-pc-mingw32
arch     x86
os       Win32
system   x86, Win32
status
major    1
minor    3.1
year     2001
month    08
day      31
language R

Thanks,

Scot McNary

--
  Scot W. McNary  email:smcnary at charm.net

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._