[R] bug? in stats::cor for use=complete.obs with NAs

Peter Ehlers ehlers at ucalgary.ca
Fri Jun 11 01:02:36 CEST 2010


I don't think that this would be considered a bug. The
reason for the discrepancy between use="complete.obs"
and use="pairwise.complete.obs" for the case of the
Spearman correlation of two vectors x, y is this:

"pairwise" does complete.cases(x,y) and then ranks;
this is also what's done in cor.test().

"complete" ranks first (keeping NAs via the
na.last="keep" argument to rank()) and then does
complete.cases(ranked.x,ranked.y) on the ranked data.
This can obviously lead to a different set of
ranks being correlated than those for "pairwise".

I must admit that I wasn't aware that R does this
and I don't know the rationale for it. The help page
says:

    If use is "complete.obs" then missing values are
    handled by casewise deletion ...

which is not clear on the order of ranking and
deletion, but further down the page:

    Note that "spearman" basically computes cor(R(x), R(y))
    (or cov(.,.)) where R(u) := rank(u, na.last="keep").
    In the case of missing values, the ranks are calculated
    depending on the value of use, either based on complete
    observations, or based on pairwise completeness with
    reranking for each pair.

I guess that this implies that, for "complete", the ranking
occurs before the casewise deletion (else why the
na.last="keep"?).

If anyone knows the rationale and/or can give a reference,
I'd be glad to receive such.

   -Peter Ehlers


On 2010-06-09 11:36, hugh.genin at thomsonreuters.com wrote:
> Arrrrr,
>
> I think I've found a bug in the behavior of the stats::cor function when
> NAs are present, but in case I'm missing something, could you look over
> this example and let me know what you think:
>
>
>> a = c(1,3,NA,1,2)
>> b = c(1,2,1,1,4)
>> cor(a,b,method="spearman", use="complete.obs")
> [1] 0.8164966
>> cor(a,b,method="spearman", use="pairwise.complete.obs")
> [1] 0.7777778
>
> My understanding is that, when the inputs are vectors (but not
> necessarily when they're matrices), the "complete.obs" and
> "pairwise.complete.obs" arguments should give identical spearman
> correlations.  The above example clearly shows they do not in my version
> of R (2.11.1).  However, in cor.test, they do:
>
>
>> cor.test(a,b,method="spearman", use="complete.obs")
>
>          Spearman's rank correlation rho
>
> data:  a and b
> S = 2.2222, p-value = 0.2222
> alternative hypothesis: true rho is not equal to 0
> sample estimates:
>        rho
> 0.7777778
>
>
> So cor and cor.test do not agree, which seems very likely to be a bug.
> When calculating by hand, I also get 0.7777778.  Additionally, when
> using an old version of R (2.5.0), both the complete.obs and
> pairwise.complete.obs versions give 0.7777778.  Which strongly suggests
> either 2.5.0 or 2.11.1 has a bug in it.  Is this a bug?  If so, has it
> already been reported?  (I found a related but confusing email thread
> from 2004 in the R archives, but I did not find the resolution to that
> bug report).
>
>
> Additional info:
> Platform = Windows XP
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                           LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Thanks,
>
> --Hugh
>



More information about the R-help mailing list