[Rd] Incorrect handling of NA's in cor() (PR#6750)

Fri Apr 9 21:50:14 CEST 2004

> X-Original-To: msa at biostat.mgh.harvard.edu
> Date: Fri, 9 Apr 2004 11:21:47 -0700 (PDT)
> From: Thomas Lumley <tlumley at u.washington.edu>
> Cc: R-bugs at biostat.ku.dk
> 
> On Fri, 9 Apr 2004, Marek Ancukiewicz wrote:
> 
> >
> > Dear Thomas,
> >
> > The question becomes: how do we rank missing values?
> 
> That's one of the questions.  It's not the only question.  Suppose x has
> no missing values but y has a missing value.  Should the ranks for x be
> based on the whole vector or just on the values where y isn't missing?
> 
> 	-thomas

I see what you mean. 

One could give an argument in favour of each of these
approaches. If we treat data primarily as pairs of values (or
more generally, cases) then we should discard incomplete pairs
(records) first and rank afterwards. If we consider x and y
primarily as separate from each other (especially with regard
to how the missing values arise) then a more natural approach
would be to do ranking before dropping incomplete pairs. In
the later approach we use more information in the data; in the
former approach we ignore the information which might be
spurious, especially when missing y values tend to coincide
with high (low) x values. Dropping NAs first and ranking later
seems to be a conservative approach; with the other approach
on should probably always check if NAs in one variable are
correlated with other variables.

My understanding is that cor() in 1.9.0 will do ranking
independently, before dropping missing pairs/cases. It would
be good to have this documented in help(), it might be also
good to add a warning on perils of the analysis with missing
values when occurrences of NAs in one variable are correlated
with other variables.

Marek