[R] correlation between categorical data

Daniel Malter daniel at umd.edu
Sun Jun 21 13:13:32 CEST 2009


 For measures of association between two variables with two values each,
Cramer's V and Yule's Q are useful statistics. Look into this thread, for
example: http://markmail.org/message/sjd53z2dv2pb5nd6

To get a grasp from plotting (sometimes), you may use the jitter function in
the plot...

e=rnorm(n,0,1)
y=x+e
xprob=exp(x)/(1+exp(x))
yprob=exp(y)/(1+exp(y))
xcat=rbinom(n,1,xprob)
ycat=rbinom(n,1,yprob)
plot(ycat~xcat) #totally useless
plot(jitter(ycat)~jitter(xcat)) #can be somewhat useful
table(ycat,xcat) # interesting

#A measure of correlation between nominal variables
yule.Q=function(x,y){(table(x,y)[1,1]*table(x,y)[2,2]-table(x,y)[1,2]*table(
x,y)[2,1])/(table(x,y)[1,1]*table(x,y)[2,2]+table(x,y)[1,2]*table(x,y)[2,1])
}
yule.Q(ycat,xcat)

Best,
Daniel




-------------------------
cuncta stricte discussurus
-------------------------

-----Ursprüngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von Marc Schwartz
Gesendet: Saturday, June 20, 2009 7:37 PM
An: Jason Morgan
Cc: r-help
Betreff: Re: [R] correlation between categorical data


On Jun 20, 2009, at 2:05 PM, Jason Morgan wrote:

> On 2009.06.19 14:04:59, Michael wrote:
>> Hi all,
>>
>> In a data-frame, I have two columns of data that are categorical.
>>
>> How do I form some sort of measure of correlation between these two 
>> columns?
>>
>> For numerical data, I just need to regress one to the other, or do 
>> some pairs plot.
>>
>> But for categorical data, how do I find and/or visualize correlation 
>> between the two columns of data?
>
> As Dylan mentioned, using crosstabs may be the easiest way. Also, a 
> simple correlation between the two variables may be informative. If 
> each variable is ordinal, you can use Kendall's tau-b (square table) 
> or tau-c (rectangular table). The former you can calculate with ?cor 
> (set method="kendall"), the latter you may have to hack something 
> together yourself, there is code on the Internet to do this. If the 
> data are nominal, then a simple chi-squared test (large-n) or Fisher's 
> exact test (small-n) may be more appropriate. There are rules about 
> which to use when one variable is ordinal and one is nominal, but I 
> don't have my notes in front of me. Maybe someone else can provide 
> more assistance (and correct me if I'm wrong :).



I would be cautious in recommending the Fisher Exact Test based upon small
samples sizes, as the FET has been shown to be overly conservative. This
also applies to the use of the continuity correction for the chi-square test
(which replicates the behavior of the FET).

For more information see:
Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample
recommendations Ian Campbell Stat in Med 26:3661-3675; 2007
http://www3.interscience.wiley.com/journal/114125487/abstract
and:
How conservative is Fisher's exact test?
A quantitative evaluation of the two-sample comparative binomial trial
Gerald G. Crans, Jonathan J. Shuster Stat Med. 2008 Aug 15;27(18):3598-611.
http://www3.interscience.wiley.com/journal/117929459/abstract


Frank also has some comments here (bottom of the page):

http://biostat.mc.vanderbilt.edu/wiki/Main/DataAnalysisDisc#Some_Important_P
oints_about_Cont


More generally, Agresti's Categorical Data Analysis is typically the first
reference in this domain to reach for. There is also a document written by
Laura Thompson which provides for a nice R companion to Agresti. It is
available from:

https://home.comcast.net/~lthompson221/Splusdiscrete2.pdf


HTH,

Marc Schwartz

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list