[R] plot - central limit theorem

Tue Oct 21 17:36:01 CEST 2008

I was just making a suggestion, feel free to act on it or ignore it at will.  My problem with the p-values is that as the sampling distribution becomes more normal with the increased sample size, the p-values do not converge to a value, but to a distribution (the uniform) and I don't expect that people who need a demonstration on the clt and who need extra help telling if the bell shaped histogram is actually normal yet or not, will be able to interpret the lower plot correctly as it converges to uniform random noise.  If instead of the p-value you can use some other measure of the difference between the observed empiric distribution of the means and the theoretical normal.  The test statistic from the normality test may work for that.  My other suggestions were based on the tests/intervals usually done based on the approximate normality.  All those should converge towards a fixed value (with a little randomness from the sampling) rather than a distribution.

I think the measure of how good the normal approximation is should be does the testing/estimation based on it behave close to how it would under a normal distribution, i.e. do tests based on the normal assumption have a type I error rate of close to alpha? And do confidence intervals based on the normal assumption contain the true value about (1-alpha)*100% of the time?  If they answer to those are yes, does it really matter what the shape of the distribution is?

As far as my reference normal in clt.demo, I can see arguments for using the observed values or the theoretical values, but as you mentioned, it is just a quick visual comparison, not an actual test, so I doubt that it would make much difference, especially with large numbers of samples.  I may add an option in the future to choose which one to use, as well as some indication of type I error rate and CI coverage based on the normal approximation.  I also should include an option to fix the x-axes, so that the user can see the spread decreasing with increased sample size.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: Yihui Xie [mailto:xieyihui at gmail.com]
> Sent: Sunday, October 19, 2008 6:33 AM
> To: Greg Snow
> Cc: roger koenker; r-help
> Subject: Re: [R] plot - central limit theorem
>
> I don't know whether showing p-values is the best approach either, but
> I'm using them only as indicators to show how good the approximation
> would be as the sample size increases. You may regard the p-values as
> a measure of goodness of fit. I don't think I need to answer the
> question of hypothesis test -- as Duncan has explained.
>
> Yes you can generate normal random numbers in the mean time and
> compare the p-values, but I prefer comparing the sample means with the
> theoretical population distribution instead of simulated normal random
> numbers.
>
> The problem with most demos in CLT is we have no means to observe how
> good is the approximation. In your clt.examp(), there is a graphical
> measure, i.e. comparing the density curve to the histogram, but that's
> not sufficient, as sometimes our eyes cannot easily detect differences
> between curves, e.g. the t-distribution and normal distribution.
> That's why I use numerical measures like p-values.
>
> P. S. I think your code in clt.examp() needs a correction: the
> parameters of the theoretical normal distribution should not be
> computed by *simulated* means & variances, but from original
> theoretical distribution. For example, for the uniform distribution
> over (a, b), mean = (a+b)/2 and sd=(b-a)/sqrt(12*n) (although in the
> case of large sample sizes these results will be very close)
>
> Regards,
> Yihui
> --
> Yihui Xie <xieyihui at gmail.com>
> Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
> Mobile: +86-15810805877
> Homepage: http://www.yihui.name
> School of Statistics, Room 1037, Mingde Main Building,
> Renmin University of China, Beijing, 100872, China
>
>
>
> On Thu, Oct 16, 2008 at 11:43 PM, Greg Snow <Greg.Snow at imail.org>
> wrote:
> > I wonder if including the p-values for the normality test is the best
> approach in you animation?  The clt does not say that the distribution
> of the means will be normal, just that it approaches normality (and
> therefore may be a decent approximation).  The normality test can just
> reject the null that the data (simulated means) comes from a normal
> distribution.  Since the true distribution of the means is not normal
> (unless you use a sample size of Inf, and I for one have better things
> to than wait for a computer to simulate several samples of size Inf)
> the null for the normality test is always false and therefore the test
> will always result in either saying it is not normal or a type II
> error.  The real goal is not to show normality, but to show that using
> the normal gives a "good enough" approximation.  I would prefer the
> bottom plot to show either the proportion of p-values from a normal
> based test on the simulated data that is less than alpha, or the
> proportion of confidence intervals based on the normal based test that
> include the true parameter.  Then the user can see when those values
> become close enough an approximation.
> >
> > What is your target audience for this demo?  In my opinion, anyone
> who could understand the bottom plot should already understand the clt
> enough not to need the demo, those that I would aim the demo at would
> just be confused by the current bottom plot.
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >