[Rd] 'merge' function: behavior w.r.t. NAs in the key column

Wed Mar 19 18:01:27 CET 2008

On Wed, 19 Mar 2008, Prof Brian Ripley wrote:

> merge() in R behaves in exactly the same way as match() (which it uses),
> and not as Mr Anders claims is 'the convention'.  (Note that == and match
> are not the same concept, nor is 'identical'.)

Splus's merge works more like x.key==y.key (or match(incomparables=NA)):
rows with NA's in the key columns are not included in the output.
I am not familiar with how 'conventional' databases deal with missing
entries in join keys.  Should we each add an argument to tell how
to deal with such rows?

This was discussed at length for match() some years ago (AFAIR when NA
character strings were formalized), and documented.  Hopefully merge() in
S-PLUS does the same as its match().

   RS> x<-data.frame(k1=c(NA,NA,3,4,5),k2=c(1,NA,NA,4,5),data=1:5)
   RS> y<-data.frame(k1=c(NA,2,NA,4,5),k2=c(NA,NA,3,4,5),data=1:5)
   RS> merge(x,y,by=c("k1","k2"))
   Splus:   k1 k2 data.x data.y
   Splus: 1  4  4      4      4
   Splus: 2  5  5      5      5
   R    :   k1 k2 data.x data.y
   R    : 1  4  4      4      4
   R    : 2  5  5      5      5
   R    : 3 NA NA      2      1
   RS> merge(x,y,by=c("k1"))
   Splus:   k1 k2.x data.x k2.y data.y
   Splus: 1  4    4      4    4      4
   Splus: 2  5    5      5    5      5
   R    :   k1 k2.x data.x k2.y data.y
   R    : 1  4    4      4    4      4
   R    : 2  5    5      5    5      5
   R    : 3 NA    1      1   NA      1
   R    : 4 NA    1      1    3      3
   R    : 5 NA   NA      2   NA      1
   R    : 6 NA   NA      2    3      3
   RS> merge(x,y,by=c("k2"))
   Splus:   k2 k1.x data.x k1.y data.y
   Splus: 1  4    4      4    4      4
   Splus: 2  5    5      5    5      5
   R    :   k2 k1.x data.x k1.y data.y
   R    : 1  4    4      4    4      4
   R    : 2  5    5      5    5      5
   R    : 3 NA   NA      2   NA      1
   R    : 4 NA   NA      2    2      2
   R    : 5 NA    3      3   NA      1
   R    : 6 NA    3      3    2      2

> I'll add an explicit link to help those who don't believe that 'match' is
> a documented concept.
>
>
> On Fri, 14 Mar 2008, Bill Dunlap wrote:
>
> > On Fri, 14 Mar 2008, Simon Anders wrote:
> >
> >> I recently ran into a problem with 'merge' that stems from the way how
> >> missing values in the key column (i.e., the column specified
> >> in the "by" argument) are handled. I wonder whether the current behavior
> >> is fully consistent.
> >> ...
> >>> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 )
> >>> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )
> >> ...
> >>> merge( x, y, by="key" )
> >>    key val.x val.y
> >> 1   2    12    22
> >> 2   3    13    23
> >> 3   3    13    26
> >> 4   3    14    23
> >> 5   3    14    26
> >> 6  NA    15    21
> >> 7  NA    15    27
> >> 8  NA    16    21
> >> 9  NA    16    27
> >>
> >> As one should expect, there are now four lines with key value '3',
> >> because the key '3' appears twice both in x and in y. According to the
> >> logic of merge, a row should be produced in the output for each pairing
> >> of a row from x and a row from y where the values of 'key' are equal.
> >>
> >> However, the 'NA' values are treated exactly the same way. It seems that
> >> 'merge' considers the pairing of lines with 'NA' in both 'key' columns
> >> an allowed match. IMHO, this runs against the convention that two NAs
> >> are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.)
> >>
> >> Is might be more consistent if merge did not include any rows into the
> >> output with an "NA" in the key column.
> >>
> >> Maybe, one could add a flag argument to 'merge' to switch between this
> >> behaviour and the current one? A note in the help page might be nice, too.
> >
> > Splus (versions 8.0, 7.0, and 6.2) gives:
> >   > merge( x, y, by="key" )
> >     key val.x val.y
> >   1   2    12    22
> >   2   3    13    23
> >   3   3    14    23
> >   4   3    13    26
> >   5   3    14    26
> > Is that what you expect?  There is no argument
> > to Splus's merge to make it include the NA's
> > in the way R's merge does.  Should there be such
> > an argument?
> >
> > ----------------------------------------------------------------------------
> > Bill Dunlap
> > Insightful Corporation
> > bill at insightful dot com
> >
> > "All statements in this message represent the opinions of the author and do
> > not necessarily reflect Insightful Corporation policy or position."
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."