[R] chisq test and fisher exact test

Weiwei Shi helprhelp at gmail.com
Thu Jun 23 01:08:04 CEST 2005


Is it b/c my question is too long so no one answers it? I should have
splitted it. :(

On 6/22/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> Weiwei Shi wrote:
> 
> >Hi,
> >I have a text mining project and currently I am working on feature
> >generation/selection part.
> >My plan is selecting a set of words or word combinations which have
> >better discriminant capability than other words in telling the group
> >id's (2 classes in this case) for a dataset which has 2,000,000
> >documents.
> >
> >One approach is using "contrast-set association rule mining" while the
> >other is using chisqr or fisher exact test.
> >
> >An example which has 3 contingency tables for 3 words as followed
> >(word coded by number):
> >
> >
> >>tab[,,1:3]
> >>
> >>
> >, , 1
> >
> >      [,1]    [,2]
> >[1,] 11266 2151526
> >[2,]   125   31734
> >
> >, , 2
> >
> >      [,1]    [,2]
> >[1,] 43571 2119221
> >[2,]    52   31807
> >
> >, , 3
> >
> >     [,1]    [,2]
> >[1,]  427 2162365
> >[2,]    5   31854
> >
> >
> >I have some questions on this:
> >1. What's the thumb of rule to use chisq test instead of Fisher exact
> >test. I have a  vague memory which said for each cell, the count needs
> >to be over 50 if chisq instead of fisher exact test is going to be
> >used. In the case of word 3,  I think I should use fisher test.
> >However, running chisq like below is fine:
> >
> >
> >>tab[,,3]
> >>
> >>
> >     [,1]    [,2]
> >[1,]  427 2162365
> >[2,]    5   31854
> >
> >
> >>chisq.test(tab[,,3])
> >>
> >>
> >
> >        Pearson's Chi-squared test with Yates' continuity correction
> >
> >data:  tab[, , 3]
> >X-squared = 0.0963, df = 1, p-value = 0.7564
> >
> >but running on the whole set of words (including 14240 words) has the
> >following warnings:
> >
> >
> >>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
> >>
> >>
> >There were 50 or more warnings (use warnings() to see the first 50)
> >
> >
> >>warnings()
> >>
> >>
> >Warning messages:
> >1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >
> >
> >2. So, my second question is, is this warning b/c I am against the
> >assumption of using chisq. But why Word 3 is fine? How to trace the
> >warning to see which word caused this warning?
> >
> >3. My result looks like this (after some mapping treating from number
> >id to word and some words are stemmed here, like ACCID is accident):
> > > of[1:50,]
> >      map...2.      p.fisher
> >21       ACCID  0.000000e+00
> >30          CD  0.000000e+00
> >67        ROCK  0.000000e+00
> >104      CRACK  0.000000e+00
> >111       CHIP  0.000000e+00
> >179      GLASS  0.000000e+00
> >84        BACK 4.199878e-291
> >395   DRIVEABL 5.335989e-287
> >60         CAP 9.405235e-285
> >262 WINDSHIELD 2.691641e-254
> >13          IV 3.905186e-245
> >110         HZ 2.819713e-210
> >11        CAMP 9.086768e-207
> >2      SHATTER 5.273994e-202
> >297        ALP 1.678521e-177
> >162        BED 1.822031e-173
> >249        BCD 1.398391e-160
> >493       RACK 4.178617e-156
> >59        CAUS 7.539031e-147
> >
> >3.1 question: Should I use two-sided test instead of one-sided for
> >fisher test? I read some material which suggests using two-sided.
> >
> >3.2 A big question: Even though the result looks very promising since
> >this is case of classiying fraud cases and the words selected by this
> >approach make sense. However, I think p-values here just indicate the
> >strength to reject null hypothesis, not the strength of association
> >between word and class of document. So, what kind of statistics I
> >should use here to evaluate the strength of association? odds ratio?
> >
> >Any suggestions are welcome!
> >
> >Thanks!
> >
> >
> You can use chisq.test with sim=TRUE, or call it as usual first, see if
> there is a warning, and then recall
> with sim=TRUE.
> 
> Kjetil
> 
> --
> 
> Kjetil Halvorsen.
> 
> Peace is the most effective weapon of mass construction.
>                --  Mahdi Elmandjra
> 
> 
> 
> 
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.323 / Virus Database: 267.7.7/20 - Release Date: 16/06/2005
> 
> 


-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III




More information about the R-help mailing list