[R] significance in difference of proportions: What problema

Mon Dec 1 10:16:55 CET 2003

> On 28-Nov-03 Torsten Hothorn wrote:
> > yes, thats my understanding too. The "enumerative techniques" as
> > you call it condition on the data actually observed and determine
> > the null distribution of the associated test statistic from the data.
> > In contrast, unconditional procedures require some assumptions to the
> > underlying data generating process from which the null distribution is
> > derived. The appropriate choice depends of the kind of experiment
> > under test: In a randomized trial we would like to see all possible
> > outcomes of the trial caused by "rerandomization" and the enumerative
> > techniques are natural here. When we draw many samples from predefined
> > populations, men and women, say, "rerandomization" of gender is of
> > course not that easy and we may assume something about the data
> > generating process :-)
>
> Nice example, but it depends on how you look at it!
>

indeed.

> Suppose you have samples of n1 Men and n2 Women and record, for instance,
> whether or not each is suffering from a cold (r1 and r2 respectively).
> Do M & W differ in their risk of catching cold?
>
> NH: No difference; implies that the R = (r1+r2) colds have selected
> a random subset of the N=(n1+n2) individuals as victims; implies
> that the n1 Men out of N are a random subset of the R+(N-R)
> Colds/NonColds. So you then have a hypergeometric distribution and are
> back with an "exact" test. But are we "assuming somthing about the
> data generating process" here?
>

This is the exact conditional approach where both the row and
column marginal totals are fixed and, because in this "simple" 2x2 case
the distribution of the test statistic is known to be hypergeometric,
there is no need for explicit enumeration (and `fisher.test()' computes the
corresponding P-values) and, indeed, we do not need to make any
assumtion.

But is fixing both the number of women and men (rows) AND the numbers of
colds (columns) intuitive when we would like to "learn" something about
the two different populations, i.e. the distribution of colds in women
and the distribution of colds in men? If we only assume the sample sizes
in both populations to be fixed, this reduces to a comparison of two
binomial parameters for two independent samples. And in this special
situation, we assume something about the data generating process: we
assume "cold" to be distributed according to binomial law
(OK: this is not really a restrictive assumption, but for continuous
responses people tend to assume something like normality). And
the comparison of two binomial distribtions leads to exact
unconditional inference: this is explained in a very nice way in Agresti
(StatMed 20, 2001, 2709-2722) and I hope read it correctly :-)

> (Of course, in the background lurks the Ogre of Exchangeability,
> in that the probability of catching cold may vary from person to
> person, whether Man or Woman, but nothing in the information
> plus NH suggests any reason to distinguish any arrangement of the
> N people from any other; equivalent to a re-randomisation of
> gender ... ??).

yes, looks like that. But re-randomization of gender ( = fixed column
marginal totals) is maybe hard to sell to our customers :-)

Best,

Torsten

>
> Best wishes,
> Ted.
>
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 167 1972
> Date: 29-Nov-03                                       Time: 10:09:18
> ------------------------------ XFMail ------------------------------
>
>