[R] Correlation when one variable has zero variance (polychoric?)

Wed Dec 19 13:54:08 CET 2007

Hi,

I'm running this for a simulation study, so many combinations of parameter  
produce many predictions that I need to correlate with data.

The problem
----------------
I'm using rating data with 3 to 5 categories (e.g., too low, correct, too  
high). The underlying continuous scales should be normal, so I chose the  
polychoric correlation. I'm using library(polychor) in its latest version  
0.7.4

The problem is that sometimes the models predict always the same value  
(i.e., the median). Example frequency table:

> table(med$ADRI_LAN, rate$ADRI_LAN)
       2   3   4   5
   3  28 179 141  50

That is, there is no variability in one of the variables (the only value  
is 3, the median).

Pearson Product Moment Correlation consists of the covariation divided by  
the square root of the product of the standard deviations of the two  
variables. If the standard deviation of one of the variables is zero, then  
the denominator is zero and the correlation cannot be computed. R returns  
NA and a warning.

If I add jitter to the variable with no variability, then I get a  
virtually zero, but calculable, Pearson correlation.

However, when I use the polychoric correlation (using the default  
settings), I get just the opposite: a very high correlation!

> polychoric    = polychor( med$ADRI_LAN, rate$ADRI_LAN ) #, ML=T,  
> std.err=T
> polychoric
[1] 0.999959

This is very counterintuitive. I also ran the same analysis in 2005 (what  
has changed in the package polycor since then, I don't know) and the  
results were different. I think back then I contrasted them with SAS and  
they were the same. Maybe the approximation fails in extreme cases where  
most of the cells are zero? Maybe the approximation was not used in the  
first releases of the package? But it seems that the ML estimator doesn't  
work at all (at least in the current version of the package) with those  
tables when most cells are zero due to no variability on one variable):

> polychor(med$ADRI_LAN, rate$ADRI_LAN, ML=T)
Error in tab * log(P) : non-conformable arrays

I've seen some posts where sparse tables were trouble, eg:  
http://www.nabble.com/polychor-error-td5954345.html#a5954345
  "You're expecting a lot out of ML to get estimates of the first couple of  
thresholds for rows and the first for columns. [which were mostly zeroes]"

Are the polychoric estimates using the approximation completely wrong? Is  
there any way to compute a polychoric correlation with such a dataset?  
What should I conclude from data like these?
Maybe using correlation is not the right thing to do.

Thanks,
-Jose

-- 
Jose Quesada, PhD.
http://www.andrew.cmu.edu/~jquesada