[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Thu Feb 22 12:31:40 CET 2018

>>>>> Gabriel Becker <gmbecker at ucdavis.edu>
>>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:

    > Hi all,
    > For the record this approach isnt 100% backwards compatible, because
    > names(mergeddf) will e incompatibly different. Thatx why i claimed
    > bakcwards compatable-ish

exactly.

    > That said its still worth considering imho because of the reasons stated
    > (and honestly one particular simple reading of the docs might suggest that
    > this was thr intended behavior all along). Im not a member of Rcore through
    > so i cant do said considering myself.

I agree with Scott, Frederik and you that this changes seems
worth considering.
As Duncan Murdoch has mentioned, this alone may not be
sufficient.

In addition to your proposed patch (which I have simplified, not
using intersection() but working with underlying  match()
directly), it is little work to introduce an extra argument, I'm
calling  'no.dups = TRUE'  which when set to false would mirror
current R's behavior... and documenting it, then also documents the
new behavior (to some extent).

My plan is to commit it soonish ;-)
Martin

    > Best,
    > ~G

    > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:

    > Hi Scott,

    > I tried the new patch and can confirm that it has the advertised
    > behavior on a couple of test cases. I think it makes sense to apply
    > it, because any existing code which refers to a second duplicate
    > data.frame column by name is already broken, while if the reference is
    > by numerical index then changing the column name shouldn't break it.

    > I don't know if you need to update the documentation as part of your
    > patch, or if whoever applies it would be happy to do that. Somebody
    > from R core want to weigh in on this?

    > I attach a file with the test example from your original email as well
    > as a second test case I added with two "by" columns.

    > Thanks,

    > Frederick

    > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> Hi Frederick,
    >> 
    >> It looks like I didn't overwrite the patch.diff file after the last edits.
    >> Here's the correct patch (attached and copied below):
    >> 
    >> Index: src/library/base/R/merge.R
    >> ===================================================================
    >> --- src/library/base/R/merge.R (revision 74280)
    >> +++ src/library/base/R/merge.R (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >> 
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still arise,
    >> +        ## apply suffixes to just y - this keeps backwards compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
    >> suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)
    >> 
    >> Best,
    >> 
    >> Scott
    >> 
    >> On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
    >> 
    >> > Hi Scott,
    >> >
    >> > I think that's a good idea and I tried your patch on my copy of the
    >> > repository. But it looks to me like the recent patch is identical to
    >> > the previous one, can you confirm this?
    >> >
    >> > Frederick
    >> >
    >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
    >> > > Thanks Gabriel,
    >> > >
    >> > > I think your suggested approach is 100% backwards compatible
    >> > >
    >> > > Currently in the case of duplicate column names only the first can be
    >> > > indexed by its name. This will always be the column appearing in by.x,
    >> > > meaning the column in y with the same name cannot be accessed.
    > Appending
    >> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
    >> > > keeping the current behaviour of making the key columns always
    > accessible
    >> > > by using the names provided to by.x.
    >> > >
    >> > > I've attached a new patch that has this behaviour.
    >> > >
    >> > > Best,
    >> > >
    >> > > Scott
    >> > >
    >> > > On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
    >> > wrote:
    >> > >
    >> > > > It seems like there is a way that is backwards compatible-ish in the
    >> > sense
    >> > > > mentioned and still has the (arguably, but a good argument I think)
    >> > better
    >> > > > behavior:
    >> > > >
    >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
    >> > column
    >> > > > is called name and y's 'name' column (not used int he merge) is
    >> > changed to
    >> > > > name.y.
    >> > > >
    >> > > > Now of course this would still change output, but it would change
    > it to
    >> > > > something I think would be better, while retaining the 'merge
    > columns
    >> > > > retain their exact names' mechanic as documented.
    >> > > >
    >> > > > ~G
    >> > > >
    >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
    > s.ritchie73 at gmail.com>
    >> > > > wrote:
    >> > > >
    >> > > >> Thanks Duncan and Frederick,
    >> > > >>
    >> > > >> I suspected as much - there doesn't appear to be any reason why
    >> > conflicts
    >> > > >> between by.x and names(y) shouldn't and cannot be checked, but I
    > can
    >> > see
    >> > > >> how this might be more trouble than its worth given it potentially
    > may
    >> > > >> break downstream packages (i.e. any cases where this occurs but
    > they
    >> > > >> expect
    >> > > >> the name of the key column(s) to remain the same).
    >> > > >>
    >> > > >> Best,
    >> > > >>
    >> > > >> Scott
    >> > > >>
    >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
    >> > murdoch.duncan at gmail.com>
    >> > > >> wrote:
    >> > > >>
    >> > > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
    >> > > >> >
    >> > > >> >> Hi Scott,
    >> > > >> >>
    >> > > >> >> Thanks for the patch. I'm not really involved in R development;
    > it
    >> > > >> >> will be up to someone in the R core team to apply it. I would
    >> > hazard
    >> > > >> >> to say that even if correct (I haven't checked), it will not be
    >> > > >> >> applied because the change might break existing code. For
    > example
    >> > it
    >> > > >> >> seems like reasonable code might easily assume that a column
    > with
    >> > the
    >> > > >> >> same name as "by.x" exists in the output of 'merge'. That's
    > just my
    >> > > >> >> best guess... I don't participate on here often.
    >> > > >> >>
    >> > > >> >
    >> > > >> >
    >> > > >> > I think you're right.  If I were still a member of R Core, I
    > would
    >> > want
    >> > > >> to
    >> > > >> > test this against all packages on CRAN and Bioconductor, and
    > since
    >> > that
    >> > > >> > test takes a couple of days to run on my laptop, I'd probably
    > never
    >> > get
    >> > > >> > around to it.
    >> > > >> >
    >> > > >> > There are lots of cases where "I would have done that
    > differently",
    >> > but
    >> > > >> > most of them are far too much trouble to change now that R is
    > more
    >> > than
    >> > > >> 20
    >> > > >> > years old.  And in many cases it will turn out that the way R
    > does
    >> > it
    >> > > >> > actually does make more sense than the way I would have done it.
    >> > > >> >
    >> > > >> > Duncan Murdoch
    >> > > >> >
    >> > > >> >
    >> > > >> >
    >> > > >> >> Cheers,
    >> > > >> >>
    >> > > >> >> Frederick
    >> > > >> >>
    >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
    >> > > >> >>
    >> > > >> >>> The attached patch.diff will make merge.data.frame() append the
    >> > > >> suffixes
    >> > > >> >>> to
    >> > > >> >>> columns with common names between by.x and names(y).
    >> > > >> >>>
    >> > > >> >>> Best,
    >> > > >> >>>
    >> > > >> >>> Scott Ritchie
    >> > > >> >>>
    >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
    >> > s.ritchie73 at gmail.com>
    >> > > >> >>> wrote:
    >> > > >> >>>
    >> > > >> >>> Hi Frederick,
    >> > > >> >>>>
    >> > > >> >>>> I would expect that any duplicate names in the resulting
    >> > data.frame
    >> > > >> >>>> would
    >> > > >> >>>> have the suffixes appended to them, regardless of whether or
    > not
    >> > they
    >> > > >> >>>> are
    >> > > >> >>>> used as the join key. So in my example I would expect
    > "names.x"
    >> > and
    >> > > >> >>>> "names.y" to indicate their source data.frame.
    >> > > >> >>>>
    >> > > >> >>>> While careful reading of the documentation reveals this is not
    >> > the
    >> > > >> >>>> case, I
    >> > > >> >>>> would argue the intent of the suffixes functionality should
    >> > equally
    >> > > >> be
    >> > > >> >>>> applied to this type of case.
    >> > > >> >>>>
    >> > > >> >>>> If you agree this would be useful, I'm happy to write a patch
    > for
    >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
    > intend
    >> > to do
    >> > > >> >>>> the
    >> > > >> >>>> same for merge.data.table in the data.table package where I
    >> > initially
    >> > > >> >>>> encountered the edge case.
    >> > > >> >>>>
    >> > > >> >>>> Best,
    >> > > >> >>>>
    >> > > >> >>>> Scott
    >> > > >> >>>>
    >> > > >> >>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
    >> > > >> >>>>
    >> > > >> >>>> Hi Scott,
    >> > > >> >>>>>
    >> > > >> >>>>> It seems like reasonable behavior to me. What result would
    > you
    >> > > >> expect?
    >> > > >> >>>>> That the second "name" should be called "name.y"?
    >> > > >> >>>>>
    >> > > >> >>>>> The "merge" documentation says:
    >> > > >> >>>>>
    >> > > >> >>>>>      If the columns in the data frames not used in merging
    > have
    >> > any
    >> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’
    > by
    >> > > >> default)
    >> > > >> >>>>>      appended to try to make the names of the result unique.
    >> > > >> >>>>>
    >> > > >> >>>>> Since the first "name" column was used in merging, leaving
    > both
    >> > > >> >>>>> without a suffix seems consistent with the documentation...
    >> > > >> >>>>>
    >> > > >> >>>>> Frederick
    >> > > >> >>>>>
    >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
    > wrote:
    >> > > >> >>>>>
    >> > > >> >>>>>> Hi,
    >> > > >> >>>>>>
    >> > > >> >>>>>> I was unable to find a bug report for this with a cursory
    >> > search,
    >> > > >> but
    >> > > >> >>>>>>
    >> > > >> >>>>> would
    >> > > >> >>>>>
    >> > > >> >>>>>> like clarification if this is intended or unavoidable
    >> > behaviour:
    >> > > >> >>>>>>
    >> > > >> >>>>>> ```{r}
    >> > > >> >>>>>> # Create example data.frames
    >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
    >> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
    >> > > >> >>>>>>                        age=c(41, 43, 36, 51))
    >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
    >> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
    >> > "Kai-lee"),
    >> > > >> >>>>>>                         sex=c("M", "M", "F"),
    >> > > >> >>>>>>                         age=c(5,8,7))
    >> > > >> >>>>>>
    >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
    >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Output:
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
    >> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
    >> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
    >> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
    >> > > >> >>>>>> Warning message:
    >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
    >> > > >> >>>>>> "parent") :
    >> > > >> >>>>>>    column name ‘name’ is duplicated in the result
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Kind Regards,
    >> > > >> >>>>>>
    >> > > >> >>>>>> Scott Ritchie
    >> > > >> >>>>>>
    >> > > >> >>>>>>        [[alternative HTML version deleted]]
    >> > > >> >>>>>>
    >> > > >> >>>>>> ______________________________________________
    >> > > >> >>>>>> R-devel at r-project.org mailing list
    >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>>>>>
    >> > > >> >>>>>>
    >> > > >> >>>>>
    >> > > >> >>>>
    >> > > >> >>>>
    >> > > >> >> Index: src/library/base/R/merge.R
    >> > > >> >>> ============================================================
    >> > =======
    >> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
    >> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
    >> > > >> >>> @@ -157,6 +157,15 @@
    >> > > >> >>>           }
    >> > > >> >>>             if(has.common.nms) names(y) <- nm.y
    >> > > >> >>> +        ## If by.x %in% names(y) then duplicate column names
    >> > still
    >> > > >> >>> arise,
    >> > > >> >>> +        ## apply suffixes to these
    >> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
    >> > > >> >>> +        if(length(dupe.keyx)) {
    >> > > >> >>> +          if(nzchar(suffixes[1L]))
    >> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
    >> > > >> >>> +          if(nzchar(suffixes[2L]))
    >> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
    >> > > >> >>> +        }
    >> > > >> >>>           nm <- c(names(x), names(y))
    >> > > >> >>>           if(any(d <- duplicated(nm)))
    >> > > >> >>>               if(sum(d) > 1L)
    >> > > >> >>>
    >> > > >> >>
    >> > > >> >> ______________________________________________
    >> > > >> >> R-devel at r-project.org mailing list
    >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>
    >> > > >> >>
    >> > > >> >
    >> > > >>
    >> > > >>         [[alternative HTML version deleted]]
    >> > > >>
    >> > > >> ______________________________________________
    >> > > >> R-devel at r-project.org mailing list
    >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >>
    >> > > >
    >> > > >
    >> > > >
    >> > > > --
    >> > > > Gabriel Becker, PhD
    >> > > > Scientist (Bioinformatics)
    >> > > > Genentech Research
    >> > > >
    >> >
    >> > > Index: src/library/base/R/merge.R
    >> > > ===================================================================
    >> > > --- src/library/base/R/merge.R        (revision 74264)
    >> > > +++ src/library/base/R/merge.R        (working copy)
    >> > > @@ -157,6 +157,15 @@
    >> > >          }
    >> > >
    >> > >          if(has.common.nms) names(y) <- nm.y
    >> > > +        ## If by.x %in% names(y) then duplicate column names still
    >> > arise,
    >> > > +        ## apply suffixes to these
    >> > > +        dupe.keyx <- intersect(nm.by, names(y))
    >> > > +        if(length(dupe.keyx)) {
    >> > > +          if(nzchar(suffixes[1L]))
    >> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> > > +          if(nzchar(suffixes[2L]))
    >> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> > > +        }
    >> > >          nm <- c(names(x), names(y))
    >> > >          if(any(d <- duplicated(nm)))
    >> > >              if(sum(d) > 1L)
    >> >
    >> > > ______________________________________________
    >> > > R-devel at r-project.org mailing list
    >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >
    >> >

    >> Index: src/library/base/R/merge.R
    >> ===================================================================
    >> --- src/library/base/R/merge.R        (revision 74280)
    >> +++ src/library/base/R/merge.R        (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >> 
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still arise,
    >> +        ## apply suffixes to just y - this keeps backwards compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
    > suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel