[Rd] Incorrect handling of NA's in cor() (PR#6750)

Marek Ancukiewicz msa at biostat.mgh.harvard.edu
Fri Apr 9 20:08:17 CEST 2004


Dear Thomas,

The question becomes: how do we rank missing values?  In
version 1.8.1 at least, cor () uses default handling of
missing values by rank() [by na.last parameter], that is
missing values are assigned the highest rank. However, if
nothing is known about the meaning of NA what would be the
basis of such an assumption?  Assigning the NAs highest,
lowest values, or any other values requires some additional
information.

It seems that the default handling on missing values should be
to assign them missing ranks: within cor(), rank() should be
called with na.last="keep". However, cor() could have an
additional parameter, such as na.rank which would allow to
account for known ranking of missing values, and which would 
be passed to rank()

By the way, if this were possible [and probably it isn't
because of compatibility with Splus] I would change, in rank()
the naming of "na.last" parameter to "na.rank" with values
such as "last", "first","remove", and "na". That would seem
easier to remember. Also, perhaps the default value should be
"na".

Regards,

Marek


> X-Original-To: msa at biostat.mgh.harvard.edu
> Date: Fri, 9 Apr 2004 10:42:59 -0700 (PDT)
> From: Thomas Lumley <tlumley at u.washington.edu>
> Cc: r-devel at stat.math.ethz.ch, R-bugs at biostat.ku.dk
> 
> On Fri, 9 Apr 2004 msa at biostat.mgh.harvard.edu wrote:
> 
> >
> > Dear Uwe,
> >
> > You are wrong. First, I've read the help file before
> > submitting the report. For two variables,
> > use="pairwise.complete.obs" and use="complete.obs" should be
> > equivalent, shouldn't it? Of sourse, the results will be
> > different when we have more than 2 variables. Second, with the
> > call you proposed I am also getting incorrect result:
> >
> 
> I think it's more complicated than either of you are considering.
> 
> For the Pearson correlation everything is straightforward, and
> pairwise.complete is the same as complete, which is the same as dropping
> the NAs manually.
> 
> For the rank correlations the question is when the ranking should be done.
> The cor() function ranks the observations and then drops missing values,
> the manual approach drops missing values and then ranks.
> 
> I'm not convinced that it is obvious which of these is right, though
> certainly the help page should document whichever is being done.
> 
> 
> 	-thomas
>



More information about the R-devel mailing list