[R] On Corrections for Chi-Sq Goodness of Fit Test

Fri Dec 23 04:56:07 CET 2011

On 20/12/11 10:24, Michael Fuller wrote:
> TOPIC
> My question regards the philosophy behind how R implements corrections to chi-square statistical tests. At least in recent versions (I'm using 2.13.1 (2011-07-08) on OSX 10.6.8.), the chisq.test function applies the Yates continuity correction for 2 by 2 contingency tables. But when used as a goodness of fit test (GoF, aka likelihood ratio test), chisq.test does not appear to implement any corrections for widely recognized problems, such as small sample size, non-uniform expected frequencies, and one D.F.
>
> > From the help page:
> "In the goodness-of-fit case simulation is done by random sampling from the discrete distribution specified by p, each sample being of size n = sum(x)."
>
> Is the thinking that random sampling completely obviates the need for corrections?
     Yes.
> Wouldn't the same statistical issues still apply
     No.
> (e.g. poor continuity approximation with one D.F.,
     There are no degrees of freedom involved.  There is no continuity 
involved.
     The observed test statistics (say "Stat") is compared with a number of
     test statistics, Stat_1, ..., Stat_N, calculated from data sets 
simulated under
     the null hypothesis.  If the null is true, then Stat and Stat_1, 
...., Stat_N are
     all of ``equal status''.  If there are m values of the Stat_i which 
are greater
     than Stat, then the ``probability of observing, under the null 
hypothesis,
     data as extreme as, or more extreme than, what you actually observed''
     is the probability of randomly selecting one of a specified set of 
m+1 ``slots''
     out of a total of N+1 slots (where each slot has probability 1/(N+1)).

     Thus the p-value is (exactly) equal to (m+1)/(N+1).

     The only restriction is that there be no ties amongst the values of 
Stat
     and Stat_1, ..., Stat_N.  There being ties is of fairly low 
probability, but is
     not of zero probability --- since there is a finite number of 
possible samples
     and hence of statistic values.  So this restriction is a mild worry.

     However a ``continuity correction'' would be of no help whatsoever.
> problems with non-uniform expected frequencies, etc) with random sampling?

     Don't understand what you mean by this.

         cheers,

             Rolf Turner