[R] (OT) Does pearson correlation assume bivariate normality of the data?
landronimirc at gmail.com
Tue May 26 21:10:58 CEST 2009
The other day I was reading this post  that slightly surprised me:
"To reject the null of no correlation, an hypothsis test based on the
normal distribution. If normality is not the base assumption your
working from then p-values, significance tests and conf. intervals
dont mean much (the value of the coefficient is not reliable) " (BOB
To me this implied that in practice Pearson's product-moment
correlation (and associated significance) is often used incorrectly .
Then I went wrestling with the literature, and with my friends on what
does the Pearson correlation actually impose, and after about a week
I'm still head-banging against divergent opinions. From what I
understand there are two aspects to this classical parametric
1. Estimating the magnitude of the correlation:
- the sample data should come from a bivariate normal distribution
(?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples
such as ?rrcov::maryo or Wilcox 2005)
- the sample data should be (I presume univariate) normal (Crawley
- the sample data can be of any distribution (if I understand
correctly the `distribution-free' definition of correlation in Huber
- the sample data could come from just about any bivariate
distribution (Wikipedia  and associated reference)
- the coefficient is (very) not robust to univariate outliers (e.g.,
Huber 1981), and to multivariate outliers (?rrcov::maryo with data
from Marona and Yohai 1998)
2. Assessing whether the correlation is significantly different from
zero (using a statistic following the t distribution):
- the data should come from independent normal distributions (?cor.test)
- at least one of the marginal distributions is normal (Wilcox 2005)
Surprisingly (to me) many sources seem quite evasive on clearly
defining the pearson correlation. Reading the literature I was pretty
much convinced that the correlation coefficient is not robust to
outliers. The literature is also convincing on the impact of
contaminated normal, heavy-tailed distributions on parametric tests
(invalidating their results). However, I'm not clear on the
distributional assumptions on the data:
- does the data have to be bivariate normal in order to correctly
estimate the linear correlation?
- does the data have to be univariate normal in order to correctly
estimate the significance of the correlation?
If the above is true, what are the preferable alternatives for
non-gaussian data (including heavy-tailed normal)? non-parametric
tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk,
robust::covRob()? hypothesis testing via Permutation Tests ? is
there a robust cor.test? other robust tests of independence?
Do you know how to read?
Do you know how to write?
More information about the R-help