[R] bug? in stats::cor for use=complete.obs with NAs
ehlers at ucalgary.ca
Fri Jun 11 01:02:36 CEST 2010
I don't think that this would be considered a bug. The
reason for the discrepancy between use="complete.obs"
and use="pairwise.complete.obs" for the case of the
Spearman correlation of two vectors x, y is this:
"pairwise" does complete.cases(x,y) and then ranks;
this is also what's done in cor.test().
"complete" ranks first (keeping NAs via the
na.last="keep" argument to rank()) and then does
complete.cases(ranked.x,ranked.y) on the ranked data.
This can obviously lead to a different set of
ranks being correlated than those for "pairwise".
I must admit that I wasn't aware that R does this
and I don't know the rationale for it. The help page
If use is "complete.obs" then missing values are
handled by casewise deletion ...
which is not clear on the order of ranking and
deletion, but further down the page:
Note that "spearman" basically computes cor(R(x), R(y))
(or cov(.,.)) where R(u) := rank(u, na.last="keep").
In the case of missing values, the ranks are calculated
depending on the value of use, either based on complete
observations, or based on pairwise completeness with
reranking for each pair.
I guess that this implies that, for "complete", the ranking
occurs before the casewise deletion (else why the
If anyone knows the rationale and/or can give a reference,
I'd be glad to receive such.
On 2010-06-09 11:36, hugh.genin at thomsonreuters.com wrote:
> I think I've found a bug in the behavior of the stats::cor function when
> NAs are present, but in case I'm missing something, could you look over
> this example and let me know what you think:
>> a = c(1,3,NA,1,2)
>> b = c(1,2,1,1,4)
>> cor(a,b,method="spearman", use="complete.obs")
>  0.8164966
>> cor(a,b,method="spearman", use="pairwise.complete.obs")
>  0.7777778
> My understanding is that, when the inputs are vectors (but not
> necessarily when they're matrices), the "complete.obs" and
> "pairwise.complete.obs" arguments should give identical spearman
> correlations. The above example clearly shows they do not in my version
> of R (2.11.1). However, in cor.test, they do:
>> cor.test(a,b,method="spearman", use="complete.obs")
> Spearman's rank correlation rho
> data: a and b
> S = 2.2222, p-value = 0.2222
> alternative hypothesis: true rho is not equal to 0
> sample estimates:
> So cor and cor.test do not agree, which seems very likely to be a bug.
> When calculating by hand, I also get 0.7777778. Additionally, when
> using an old version of R (2.5.0), both the complete.obs and
> pairwise.complete.obs versions give 0.7777778. Which strongly suggests
> either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it
> already been reported? (I found a related but confusing email thread
> from 2004 in the R archives, but I did not find the resolution to that
> bug report).
> Additional info:
> Platform = Windows XP
> R version 2.11.1 (2010-05-31)
>  LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
>  LC_NUMERIC=C LC_TIME=English_United
> attached base packages:
>  stats graphics grDevices utils datasets methods base
> loaded via a namespace (and not attached):
>  tools_2.11.1
>  "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
More information about the R-help