[R] (OT) Does pearson correlation assume bivariate normality of the data?
Liviu Andronic
landronimirc at gmail.com
Tue May 26 21:10:58 CEST 2009
Dear all,
The other day I was reading this post [1] that slightly surprised me:
"To reject the null of no correlation, an hypothsis test based on the
normal distribution. If normality is not the base assumption your
working from then p-values, significance tests and conf. intervals
dont mean much (the value of the coefficient is not reliable) " (BOB
SAMOHYL).
To me this implied that in practice Pearson's product-moment
correlation (and associated significance) is often used incorrectly .
Then I went wrestling with the literature, and with my friends on what
does the Pearson correlation actually impose, and after about a week
I'm still head-banging against divergent opinions. From what I
understand there are two aspects to this classical parametric
procedure:
1. Estimating the magnitude of the correlation:
- the sample data should come from a bivariate normal distribution
(?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples
such as ?rrcov::maryo or Wilcox 2005)
- the sample data should be (I presume univariate) normal (Crawley
2007)
- the sample data can be of any distribution (if I understand
correctly the `distribution-free' definition of correlation in Huber
1981, 2004)
- the sample data could come from just about any bivariate
distribution (Wikipedia [2][3] and associated reference)
- the coefficient is (very) not robust to univariate outliers (e.g.,
Huber 1981), and to multivariate outliers (?rrcov::maryo with data
from Marona and Yohai 1998)
2. Assessing whether the correlation is significantly different from
zero (using a statistic following the t distribution):
- the data should come from independent normal distributions (?cor.test)
- at least one of the marginal distributions is normal (Wilcox 2005)
Surprisingly (to me) many sources seem quite evasive on clearly
defining the pearson correlation. Reading the literature I was pretty
much convinced that the correlation coefficient is not robust to
outliers. The literature is also convincing on the impact of
contaminated normal, heavy-tailed distributions on parametric tests
(invalidating their results). However, I'm not clear on the
distributional assumptions on the data:
- does the data have to be bivariate normal in order to correctly
estimate the linear correlation?
- does the data have to be univariate normal in order to correctly
estimate the significance of the correlation?
If the above is true, what are the preferable alternatives for
non-gaussian data (including heavy-tailed normal)? non-parametric
tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk,
robust::covRob()? hypothesis testing via Permutation Tests [4]? is
there a robust cor.test? other robust tests of independence?
Thank you,
Liviu
[1] http://www.nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html
[2] http://en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution
[3] http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution
[4] http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest
--
Do you know how to read?
http://www.alienetworks.com/srtest.cfm
Do you know how to write?
http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail
More information about the R-help
mailing list