[R] comparing strength of association instead of strength of evidence?

Kjetil Brinchmann Halvorsen kjetil at acelerate.com
Sun Jul 10 14:51:54 CEST 2005

```Weiwei Shi wrote:

>Dear all:
>I still need some further help since I think the question itself might
>be very interesting (i hope so:) :
>the question is on chisq.test, my concern is which criteria should be
>used here to evaluate the independence. The reason i use this old
>subject of the email is, b/c I think the problem here is about how to
>look at p.value, which evaluate the strength of evidence instead of
>association. If i am wrong, please correct me.
>
>the result looks like this:
>   index   word.comb     id in.class0 in.class1      p.value odds.ratio
>1      1  TOTAL|LAID 54|241         2         4 0.0004997501 0.00736433
>2      2 THEFT|RECOV  52|53     40751       146 0.0004997501 4.17127643
>3      3  BOLL|ACCID  10|21     36825      1202 0.0004997501 0.44178546
>4      4  LAB|VANDAL   8|55     24192       429 0.0004997501 0.82876099
>5      5 VANDAL|CAUS  55|59       801        64 0.0004997501 0.18405918
>6      6    AI|TOTAL   9|54      1949        45 0.0034982509 0.63766766
>7      7    AI|RECOV   9|53      2385        61 0.0004997501 0.57547012
>8      8 THEFT|TOTAL  52|54     33651       110 0.0004997501 4.56174408
>
>the target is to look for best subset of word.comb to differentiate
>between class0 and class1. p.value is obtained via
>p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)\$p.value)
>and B=20000 (I increased B and it won't help. the margin here is
>class0=2162792
>class1=31859
>)
>
>So, in conclusion, which one I should use first, odds.ratio or p.value
>to find the best subset.
>
>
>
If your goal is to discriminate between two different classes, why not
calculate directly
a measure of discriminative ability, such as probability of correct
classification?

Kjetil

>I read some on feature selection in text categorization (A comparative
>study on feature selection in text categorization might be a good
>ref.). Anyone here has other suggestions?
>
>thanks,
>
>weiwei
>
>
>On 6/24/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
>
>
>>On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
>>
>>
>>>Weiwei Shi wrote:
>>>
>>>
>>>
>>>>Hi,
>>>>I asked this question before, which was hidden in a bunch of
>>>>questions. I repharse it here and hope I can get some help this time:
>>>>
>>>>I have 2 contingency tables which have the same group variable Y. I
>>>>want to compare the strength of association between X1/Y and X2/Y. I
>>>>am not sure if comparing p-values IS the way  even though the
>>>>probability of seeing such "weird" observation under H0 defines
>>>>p-value and it might relate to the strength of association somehow.
>>>>But I read the following statement from Alan Agresti's "An
>>>>Introduction to Categorical Data Analysis" :
>>>>"Chi-squared tests simply indicate the degree of EVIDENCE for an
>>>>association....It is sensible to decompose chi-squared into
>>>>components, study residuals, and estimate parameters such as odds
>>>>ratios that describe the STRENGTH OF ASSOCIATION".
>>>>
>>>>
>>>>
>>>>
>>>>
>>>Here are some things you can do:
>>>
>>> > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
>>>
>>> > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
>>> > library(epitools) # on CRAN
>>> > ?odds.ratio
>>>Help for 'odds.ratio' is shown in the browser
>>> > library(help=epitools) # on CRAN
>>> > tab1
>>>     [,1]    [,2]
>>>[1,] 11266 2151526
>>>[2,]   125   31734
>>> > odds.ratio(11266, 125, 2151526, 31734)
>>>Error in fisher.test(tab) : FEXACT error 40.
>>>Out of workspace.                 # so this are evidently for tables
>>>with smaller counts
>>> > library(vcd) # on CRAN
>>>
>>> > ?oddsratio
>>>Help for 'oddsratio' is shown in the browser
>>> > oddsratio( tab1)  # really is logodds ratio
>>>[1] 0.2807548
>>> > plot(oddsratio( tab1) )
>>> > library(help=vcd) # on CRAN  Read this for many nice functions.
>>> > fourfoldplot(tab1)
>>> > mosaicplot(tab1)     # not really usefull for this table
>>>
>>>Also has a look at function Crosstable in package gmodels.
>>>
>>>To decompose the chisqure you can program yourselves:
>>>
>>>decomp.chi <- function(tab) {
>>>       rows <-  rowSums(tab)
>>>       cols <-   colSums(tab)
>>>       N <-   sum(rows)
>>>        E <- rows %o% cols / N
>>>        contrib <- (tab-E)^2/E
>>>        contrib }
>>>
>>>
>>> > decomp.chi(tab1)
>>>         [,1]         [,2]
>>>[1,] 0.1451026 0.0007570624
>>>[2,] 9.8504915 0.0513942218
>>> >
>>>
>>>So you can easily see what cell contributes most to the overall chisquared.
>>>
>>>Kjetil
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>Can I do this "decomposition" in R for the following example including
>>>>2 contingency tables?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
>>>>>tab1
>>>>>
>>>>>
>>>>>
>>>>>
>>>>     [,1]    [,2]
>>>>[1,] 11266 2151526
>>>>[2,]   125   31734
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
>>>>>tab2
>>>>>
>>>>>
>>>>>
>>>>>
>>>>     [,1]    [,2]
>>>>[1,] 43571 2119221
>>>>[2,]    52   31807
>>>>
>>>>
>>>>
>>Here are a few more ways of doing this using chisq.test,
>>glm and assocplot:
>>
>>
>>
>>>## chisq.test ###
>>>
>>>
>>>tab1.chisq <- chisq.test(tab1)
>>>
>>>
>>># decomposition of chisq
>>>resid(tab1.chisq)^2
>>>
>>>
>>          [,1]         [,2]
>>[1,] 0.1451026 0.0007570624
>>[2,] 9.8504915 0.0513942218
>>
>>
>>
>>># same
>>>with(tab1.chisq, (observed - expected)^2/expected)
>>>
>>>
>>          [,1]         [,2]
>>[1,] 0.1451026 0.0007570624
>>[2,] 9.8504915 0.0513942218
>>
>>
>>
>>
>>># Pearson residuals
>>>resid(tab1.chisq)
>>>
>>>
>>           [,1]        [,2]
>>[1,]  0.3809234 -0.02751477
>>[2,] -3.1385493  0.22670294
>>
>>
>>
>>># same
>>>with(tab1.chisq, (observed - expected)/sqrt(expected))
>>>
>>>
>>           [,1]        [,2]
>>[1,]  0.3809234 -0.02751477
>>[2,] -3.1385493  0.22670294
>>
>>
>>
>>
>>>### glm ###
>>># Pearson residuals via glm
>>>
>>>
>>>tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
>>>tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
>>>resid(tab1.glm, type = "pearson")
>>>
>>>
>>          1           2           3           4
>> 0.38092339 -3.13854927 -0.02751477  0.22670294
>>
>>
>>>plot(tab1.glm)
>>>
>>>
>>>### assocplot ###
>>># displaying Pearson residuals via an assocplot
>>>assocplot(t(tab1))
>>>
>>>
>
>
>
>

--

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
--  Mahdi Elmandjra

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.

```