[R] 2x2 test: total confusion.

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Wed Oct 6 17:30:32 CEST 2004


I wan't a test for the 'association' between two events, lets say the
color of balls picked and the pickers (this is quite a good analogy to my
data).

I have       200 different pickers P
I have     1,000 colors of balls   C

I have 1,000,000 picks in total


I am totally confused about what test to apply and when and why. 


This is what I *think*

I know how many balls each picker picked      - so that marginal is fixed.
I know how many balls of each color there are - so that marginal is fixed.
I know the total picks.

I can test the 'association' between Picker p and color c by doing the
following...

prob_of_pick(p)  = picks made by p  / total picks
prob_of_color(c) = balls of color c / total picks 

prob_of_sucess = prob_of_pick_of_color(pc) = 
  picks made by p  / total picks *
  balls of color c / total picks


USE BINOMIAL DISTRIBUTION

  n = total picks
  k = number of balls of color c picked by picker p
  p = prob_of_pick_of_color(pc)


Significance of this particular observation = 

if( k < n*p ){
  for (x in 0:k){
    sig += dbinom(x,n,p)
  }
}
else{
  for (x in k:n){
    sig += dbinom(x,n,p)
  }
}

In the case that np and npq > 10, I use the normal approximation to the
binomial distribution with mean np and variance np(1-p), and correction
for continuity (+-0.5 depending on the direction of the test).

Should I use Fishers exact test? What do I do when the numbers are very
large?

Here is a sample of my data...

COLOR	PICKER	PICKED	C_TOTAL P_TOTAL GRAND_TOTAL
46458   rs      2       706     3285    878702
46548   rs      6       725     3285    878702
46557   rs      2       180     3285    878702
46561   rs      1       243     3285    878702
46565   rs      2       1864    3285    878702
46579   rs      1       1263    3285    878702
46589   rs      3       1168    3285    878702
46600   rs      2       301     3285    878702
46604   rs      1       105     3285    878702
46609   rs      1       302     3285    878702
46626   rs      32      1532    3285    878702
...
89095   ho      1       265     1369    878702
89124   ho      1       176     1369    878702
89360   ho      2       290     1369    878702
89392   ho      1       146     1369    878702
89447   ho      1       114     1369    878702
89550   ho      1       413     1369    878702
89919   ho      1       174     1369    878702
90002   ho      2       183     1369    878702
90096   ho      1       154     1369    878702
90123   ho      4       2130    1369    878702


How can I simply add an extra column to this data that gives me a measure
of the significance of 'association' (positive or negative) between Picker
and color?

I am totally confused!

Sorry for the lenght of the email.... Dan.




More information about the R-help mailing list