[R] Is it safe? Cochran etc

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sun Oct 10 12:29:09 CEST 2004


On Sat, 9 Oct 2004, Kjetil Brinchmann Halvorsen wrote:

>Dan Bolser wrote:
>
>>I have the following contingency table
>>
>>dat <- matrix(c(1,506,13714,878702),nr=2)
>>
>>And I want to test if their is an association between events 
>>
>>A:{a,not(a)} and B:{b,not(b)}
>>
>>        | b   | not(b) |
>>--------+-----+--------+
>> a      |   1 |  13714 |
>>--------+-----+--------+
>> not(a) | 506 | 878702 |
>>--------+-----+--------+
>>
>>I am worried that prop.test and chisq.test are not valid given the low
>>counts and low probabilites associated with 'sucess' in each category.
>>  
>>
>
> > test <- matrix( c(1,506, 13714, 878702), 2,2)
> > test
>     [,1]   [,2]
>[1,]    1  13714
>[2,]  506 878702
> > chisq.test(test)
>
>        Pearson's Chi-squared test with Yates' continuity correction
>
>data:  test
>X-squared = 5.1584, df = 1, p-value = 0.02313
>
> > chisq.test(test,sim=TRUE)
>
>        Pearson's Chi-squared test with simulated p-value (based on 2000
>        replicates)
>
>data:  test
>X-squared = 6.0115, df = NA, p-value = 0.02099
>
> > chisq.test(test,sim=TRUE, B=200000)
>     # To rule out simulation uncertainty
>        Pearson's Chi-squared test with simulated p-value (based on 200000
>        replicates)

What is being simulated? Sorry for my ignorance. Is it the same as below?


>data:  test
>X-squared = 6.0115, df = NA, p-value = 0.01634
>
> > N <- sum(test)
> > rows <- rowSums(test)
> > cols <- colSums(test)
> > E <- rows %o% cols/N
> > E
>           [,1]      [,2]
>[1,]   7.787351  13707.21
>[2,] 499.212649 878708.79
> >
>
>None of the expected'eds are lesser than 5, an often  used rule of thumb 
>(which might even be
>to conservative. Just to check on the distribution:

OK, we are 'allowed' to use an approximation, but what am I approximating,
and how does this relate to fishers exact test for 2x2 table?


>rows <- round(rows)
>cols <- round(cols)

The above dosn't do anything in this example, did you mean to do 

rows <- rowSums(E)
cols <- colSums(E)

first?

>
> > pvals <- sapply(r2dtable(100000, rows, cols), function(x) 
>chisq.test(x)$p.value)
> > hist(pvals)
>     # not very good approximation to uniform histogram, but:
> > sum(pvals < 0.05)/100000
>[1] 0.03068
> > sum(pvals < 0.01)/100000
>[1] 0.00669

Above you simulate fishers exact test by making permutations on the given
table and measuring the probability of the resulting table with the same
marginals. You then count how much of this probability is more extreem
than standard critical values to see how similar to the chisq distribution
this simultion data is? Why should the probabilites above be 'uniform' as
you state?

>So the true levels are not very far from the calculated by te chisq 
>approximation, and it seems safe to use it.
>
>All of this to show that with R you are not anymore dependent on old 
>"rules os thumb",
>you can investigate for yourself.

Thanks very much for your explainations and guedeance, but it is my
exploration that is driving me mad :( I feel like I am running in circles.

Can you explain this for example...

dat <- matrix(c(500,560,1000,1000),nr=2)


prop.test(dat)$p.value
[1] 0.1464539
> chisq.test(dat)$p.value
[1] 0.1464539
> fisher.test(dat)$p.value
[1] 0.1385046

Is the difference of 0.0079493 OK

Now...

> dat <- matrix(c(5,7,10,10),nr=2)
> prop.test(dat)$p.value
[1] 0.9271224
> chisq.test(dat)$p.value
[1] 0.9271224
> fisher.test(dat)$p.value
[1] 0.7256694

The expecteds for the above are all above 5, but we get a difference of
0.201453

You may say (just use fisher) above, but then i am back where I started,
when should I use / not use fisher.

Why can't I use an exact multinomial distribution? (how?)

sorry for my total confusion,
thanks again,
Dan.


>Kjetil
>
>>Is it safe to use them, and what is the alternative? (given that
>>fisher.test can't handle this data... hold the phone...
>>
>>I just found fisher.test can handle this data if the test is one-tailed
>>and not two-tailed.
>>
>>I don't understand the difference between chisq.test, prop.test and
>>fisher.test when the hybrid=1 option is used for the fisher.test.
>>
>>I was using the binomial distribution to test the 'extremity' of the
>>observed data, but now I think I know why that is inapropriate, however,
>>with the binomial (and its approximation) at least I know what I am
>>doing. And I can do it in perl easily...
>>
>>Generally, how should I calculate fisher.test in perl (i.e. what are its
>>principles). When is it safe to approximate fisher to chisq?
>>
>>I cannot get insight into this problem...
>>
>>How come if I do...
>>
>>dat <- matrix(c(50,60,100,100),nr=2)
>>
>>prop.test(dat)$p.value
>>chisq.test(dat)$p.value
>>fisher.test(dat)$p.value
>>
>>I get 
>>
>>[1] 0.5173269
>>[1] 0.5173269
>>[1] 0.4771358
>>
>>When I looked at the binomial distribution and the normal approximation
>>thereof with similar counts I never had a p-value difference > 0.004
>>
>>I am so fed up with this problem :(
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>>
>>  
>>
>
>
>




More information about the R-help mailing list