[R] chisq test and fisher exact test

Kjetil Brinchmann Halvorsen kjetil at acelerate.com
Wed Jun 22 20:50:00 CEST 2005


Weiwei Shi wrote:

>Hi,
>I have a text mining project and currently I am working on feature
>generation/selection part.
>My plan is selecting a set of words or word combinations which have
>better discriminant capability than other words in telling the group
>id's (2 classes in this case) for a dataset which has 2,000,000
>documents.
>
>One approach is using "contrast-set association rule mining" while the
>other is using chisqr or fisher exact test.
>
>An example which has 3 contingency tables for 3 words as followed
>(word coded by number):
>  
>
>>tab[,,1:3]
>>    
>>
>, , 1
>
>      [,1]    [,2]
>[1,] 11266 2151526
>[2,]   125   31734
>
>, , 2
>
>      [,1]    [,2]
>[1,] 43571 2119221
>[2,]    52   31807
>
>, , 3
>
>     [,1]    [,2]
>[1,]  427 2162365
>[2,]    5   31854
>
>
>I have some questions on this:
>1. What's the thumb of rule to use chisq test instead of Fisher exact
>test. I have a  vague memory which said for each cell, the count needs
>to be over 50 if chisq instead of fisher exact test is going to be
>used. In the case of word 3,  I think I should use fisher test.
>However, running chisq like below is fine:
>  
>
>>tab[,,3]
>>    
>>
>     [,1]    [,2]
>[1,]  427 2162365
>[2,]    5   31854
>  
>
>>chisq.test(tab[,,3])
>>    
>>
>
>        Pearson's Chi-squared test with Yates' continuity correction
>
>data:  tab[, , 3]
>X-squared = 0.0963, df = 1, p-value = 0.7564
>
>but running on the whole set of words (including 14240 words) has the
>following warnings:
>  
>
>>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
>>    
>>
>There were 50 or more warnings (use warnings() to see the first 50)
>  
>
>>warnings()
>>    
>>
>Warning messages:
>1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>
>
>2. So, my second question is, is this warning b/c I am against the
>assumption of using chisq. But why Word 3 is fine? How to trace the
>warning to see which word caused this warning?
>
>3. My result looks like this (after some mapping treating from number
>id to word and some words are stemmed here, like ACCID is accident):
> > of[1:50,]
>      map...2.      p.fisher
>21       ACCID  0.000000e+00
>30          CD  0.000000e+00
>67        ROCK  0.000000e+00
>104      CRACK  0.000000e+00
>111       CHIP  0.000000e+00
>179      GLASS  0.000000e+00
>84        BACK 4.199878e-291
>395   DRIVEABL 5.335989e-287
>60         CAP 9.405235e-285
>262 WINDSHIELD 2.691641e-254
>13          IV 3.905186e-245
>110         HZ 2.819713e-210
>11        CAMP 9.086768e-207
>2      SHATTER 5.273994e-202
>297        ALP 1.678521e-177
>162        BED 1.822031e-173
>249        BCD 1.398391e-160
>493       RACK 4.178617e-156
>59        CAUS 7.539031e-147
>
>3.1 question: Should I use two-sided test instead of one-sided for
>fisher test? I read some material which suggests using two-sided.
>
>3.2 A big question: Even though the result looks very promising since
>this is case of classiying fraud cases and the words selected by this
>approach make sense. However, I think p-values here just indicate the
>strength to reject null hypothesis, not the strength of association
>between word and class of document. So, what kind of statistics I
>should use here to evaluate the strength of association? odds ratio?
>
>Any suggestions are welcome!
>
>Thanks!
>  
>
You can use chisq.test with sim=TRUE, or call it as usual first, see if 
there is a warning, and then recall
with sim=TRUE.

Kjetil

-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
               --  Mahdi Elmandjra




-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.




More information about the R-help mailing list