[R] Correlation when one variable has zero variance (polychoric?)

John Fox jfox at mcmaster.ca
Wed Dec 19 15:29:36 CET 2007


Dear Jose,

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Jose Quesada
> Sent: December-19-07 7:54 AM
> To: r-help at lists.r-project.org
> Subject: [R] Correlation when one variable has zero variance
> (polychoric?)
> 
> Hi,
> 
> I'm running this for a simulation study, so many combinations of
> parameter
> produce many predictions that I need to correlate with data.
> 
> The problem
> ----------------
> I'm using rating data with 3 to 5 categories (e.g., too low, correct,
> too
> high). The underlying continuous scales should be normal, so I chose
> the
> polychoric correlation. I'm using library(polychor) in its latest
> version
> 0.7.4
> 
> The problem is that sometimes the models predict always the same value
> (i.e., the median). Example frequency table:
> 
> > table(med$ADRI_LAN, rate$ADRI_LAN)
>        2   3   4   5
>    3  28 179 141  50
> 
> That is, there is no variability in one of the variables (the only
> value
> is 3, the median).
> 
> Pearson Product Moment Correlation consists of the covariation divided
> by
> the square root of the product of the standard deviations of the two
> variables. If the standard deviation of one of the variables is zero,
> then
> the denominator is zero and the correlation cannot be computed. R
> returns
> NA and a warning.
> 
> If I add jitter to the variable with no variability, then I get a
> virtually zero, but calculable, Pearson correlation.
> 
> However, when I use the polychoric correlation (using the default
> settings), I get just the opposite: a very high correlation!
> 
> > polychoric    = polychor( med$ADRI_LAN, rate$ADRI_LAN ) #, ML=T,
> > std.err=T
> > polychoric
> [1] 0.999959
> 
> This is very counterintuitive. 

This is simply a bug in polychor(), which currently does the following test:

  if (r < 1) stop("the table has fewer than 2 rows")
  if (c < 2) stop("the table has fewer than 2 columns")

That is, my intention was to check (r < 2) and report an error. Actually, it
would probably be better to return NA and report a warning.

> I also ran the same analysis in 2005
> (what
> has changed in the package polycor since then, I don't know) and the
> results were different. I think back then I contrasted them with SAS
> and
> they were the same.

I don't entirely follow this. Are you referring to the table above with one
row, more generally to table with zero marginals, or to tables in which
there are interior zeroes?

> Maybe the approximation fails in extreme cases
> where
> most of the cells are zero? Maybe the approximation was not used in the
> first releases of the package? But it seems that the ML estimator
> doesn't
> work at all (at least in the current version of the package) with those
> tables when most cells are zero due to no variability on one variable):
> 
> > polychor(med$ADRI_LAN, rate$ADRI_LAN, ML=T)
> Error in tab * log(P) : non-conformable arrays

When there are zero marginals the ML estimate cannot be unique since there
is zero information about one or more of the thresholds.

> 
> I've seen some posts where sparse tables were trouble, eg:
> http://www.nabble.com/polychor-error-td5954345.html#a5954345
>   "You're expecting a lot out of ML to get estimates of the first
> couple of
> thresholds for rows and the first for columns. [which were mostly
> zeroes]"
> 
> Are the polychoric estimates using the approximation completely wrong?

Yes. If there is a zero marginal, then it shouldn't have been computed in
the first place (and was due to the error that I mentioned).

> Is
> there any way to compute a polychoric correlation with such a dataset?

I'd say no. There is no information in the data about the correlation.

> What should I conclude from data like these?

That the data aren't informative about the parameters of interest.

> Maybe using correlation is not the right thing to do.

Presumably the normally distributed latent variables that underlie the table
have some correlation, but you can't estimate it from the data.

I'll fix polycor() (and put it a test for 0 marginals as well as single-row
or -column tables) -- thanks for the bug report.

Regards,
John

> 
> Thanks,
> -Jose
> 
> --
> Jose Quesada, PhD.
> http://www.andrew.cmu.edu/~jquesada
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list