[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Fri Feb 23 15:03:09 CET 2018

>>>>> Scott Ritchie <s.ritchie73 at gmail.com>
>>>>>     on Fri, 23 Feb 2018 12:32:41 +1100 writes:

    > Thanks Martin!
    > Can you clarify the functionality of the 'no.dups' argument so I can change
    > my patch to `data.table:::merge.data.table` accordingly?

    > - When `no.dups=TRUE` will the suffix to the by.x column name? Or will it
    > take the functionality of the second functionality where only the column in
    > y has the suffix added?
    > - When `no.dups=FALSE` will the output be the same as it currently (no
    > suffix added to either column)? Or will add the suffix to the column in y?

I had started from your patch... and worked from there.
So, there's no need (and use) to provide another one.

I also needed to update the man page, add a regression test, add
an entry to NEWS.Rd ...

Just wait until I commit..
Martin

    > Best,

    > Scott

    > On 22 February 2018 at 22:31, Martin Maechler <maechler at stat.math.ethz.ch>
    > wrote:

    >> >>>>> Gabriel Becker <gmbecker at ucdavis.edu>
    >> >>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
    >> 
    >> > Hi all,
    >> > For the record this approach isnt 100% backwards compatible, because
    >> > names(mergeddf) will e incompatibly different. Thatx why i claimed
    >> > bakcwards compatable-ish
    >> 
    >> exactly.
    >> 
    >> > That said its still worth considering imho because of the reasons
    >> stated
    >> > (and honestly one particular simple reading of the docs might
    >> suggest that
    >> > this was thr intended behavior all along). Im not a member of Rcore
    >> through
    >> > so i cant do said considering myself.
    >> 
    >> I agree with Scott, Frederik and you that this changes seems
    >> worth considering.
    >> As Duncan Murdoch has mentioned, this alone may not be
    >> sufficient.
    >> 
    >> In addition to your proposed patch (which I have simplified, not
    >> using intersection() but working with underlying  match()
    >> directly), it is little work to introduce an extra argument, I'm
    >> calling  'no.dups = TRUE'  which when set to false would mirror
    >> current R's behavior... and documenting it, then also documents the
    >> new behavior (to some extent).
    >> 
    >> My plan is to commit it soonish ;-)
    >> Martin
    >> 
    >> > Best,
    >> > ~G
    >> 
    >> > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:
    >> 
    >> > Hi Scott,
    >> 
    >> > I tried the new patch and can confirm that it has the advertised
    >> > behavior on a couple of test cases. I think it makes sense to apply
    >> > it, because any existing code which refers to a second duplicate
    >> > data.frame column by name is already broken, while if the reference
    >> is
    >> > by numerical index then changing the column name shouldn't break it.
    >> 
    >> > I don't know if you need to update the documentation as part of your
    >> > patch, or if whoever applies it would be happy to do that. Somebody
    >> > from R core want to weigh in on this?
    >> 
    >> > I attach a file with the test example from your original email as
    >> well
    >> > as a second test case I added with two "by" columns.
    >> 
    >> > Thanks,
    >> 
    >> > Frederick
    >> 
    >> > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> >> Hi Frederick,
    >> >>
    >> >> It looks like I didn't overwrite the patch.diff file after the last
    >> edits.
    >> >> Here's the correct patch (attached and copied below):
    >> >>
    >> >> Index: src/library/base/R/merge.R
    >> >> ===================================================================
    >> >> --- src/library/base/R/merge.R (revision 74280)
    >> >> +++ src/library/base/R/merge.R (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> paste(dupe.keyx,
    >> >> suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >> >>
    >> >> Best,
    >> >>
    >> >> Scott
    >> >>
    >> >> On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
    >> >>
    >> >> > Hi Scott,
    >> >> >
    >> >> > I think that's a good idea and I tried your patch on my copy of
    >> the
    >> >> > repository. But it looks to me like the recent patch is identical
    >> to
    >> >> > the previous one, can you confirm this?
    >> >> >
    >> >> > Frederick
    >> >> >
    >> >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
    >> >> > > Thanks Gabriel,
    >> >> > >
    >> >> > > I think your suggested approach is 100% backwards compatible
    >> >> > >
    >> >> > > Currently in the case of duplicate column names only the first
    >> can be
    >> >> > > indexed by its name. This will always be the column appearing
    >> in by.x,
    >> >> > > meaning the column in y with the same name cannot be accessed.
    >> > Appending
    >> >> > > ".y" (suffixes[2L]) to this column means it can now be
    >> accessed, while
    >> >> > > keeping the current behaviour of making the key columns always
    >> > accessible
    >> >> > > by using the names provided to by.x.
    >> >> > >
    >> >> > > I've attached a new patch that has this behaviour.
    >> >> > >
    >> >> > > Best,
    >> >> > >
    >> >> > > Scott
    >> >> > >
    >> >> > > On 19 February 2018 at 05:08, Gabriel Becker <
    >> gmbecker at ucdavis.edu>
    >> >> > wrote:
    >> >> > >
    >> >> > > > It seems like there is a way that is backwards compatible-ish
    >> in the
    >> >> > sense
    >> >> > > > mentioned and still has the (arguably, but a good argument I
    >> think)
    >> >> > better
    >> >> > > > behavior:
    >> >> > > >
    >> >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's
    >> 'name'
    >> >> > column
    >> >> > > > is called name and y's 'name' column (not used int he merge)
    >> is
    >> >> > changed to
    >> >> > > > name.y.
    >> >> > > >
    >> >> > > > Now of course this would still change output, but it would
    >> change
    >> > it to
    >> >> > > > something I think would be better, while retaining the 'merge
    >> > columns
    >> >> > > > retain their exact names' mechanic as documented.
    >> >> > > >
    >> >> > > > ~G
    >> >> > > >
    >> >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
    >> > s.ritchie73 at gmail.com>
    >> >> > > > wrote:
    >> >> > > >
    >> >> > > >> Thanks Duncan and Frederick,
    >> >> > > >>
    >> >> > > >> I suspected as much - there doesn't appear to be any reason
    >> why
    >> >> > conflicts
    >> >> > > >> between by.x and names(y) shouldn't and cannot be checked,
    >> but I
    >> > can
    >> >> > see
    >> >> > > >> how this might be more trouble than its worth given it
    >> potentially
    >> > may
    >> >> > > >> break downstream packages (i.e. any cases where this occurs
    >> but
    >> > they
    >> >> > > >> expect
    >> >> > > >> the name of the key column(s) to remain the same).
    >> >> > > >>
    >> >> > > >> Best,
    >> >> > > >>
    >> >> > > >> Scott
    >> >> > > >>
    >> >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
    >> >> > murdoch.duncan at gmail.com>
    >> >> > > >> wrote:
    >> >> > > >>
    >> >> > > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
    >> >> > > >> >
    >> >> > > >> >> Hi Scott,
    >> >> > > >> >>
    >> >> > > >> >> Thanks for the patch. I'm not really involved in R
    >> development;
    >> > it
    >> >> > > >> >> will be up to someone in the R core team to apply it. I
    >> would
    >> >> > hazard
    >> >> > > >> >> to say that even if correct (I haven't checked), it will
    >> not be
    >> >> > > >> >> applied because the change might break existing code. For
    >> > example
    >> >> > it
    >> >> > > >> >> seems like reasonable code might easily assume that a
    >> column
    >> > with
    >> >> > the
    >> >> > > >> >> same name as "by.x" exists in the output of 'merge'.
    >> That's
    >> > just my
    >> >> > > >> >> best guess... I don't participate on here often.
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> > I think you're right.  If I were still a member of R Core,
    >> I
    >> > would
    >> >> > want
    >> >> > > >> to
    >> >> > > >> > test this against all packages on CRAN and Bioconductor,
    >> and
    >> > since
    >> >> > that
    >> >> > > >> > test takes a couple of days to run on my laptop, I'd
    >> probably
    >> > never
    >> >> > get
    >> >> > > >> > around to it.
    >> >> > > >> >
    >> >> > > >> > There are lots of cases where "I would have done that
    >> > differently",
    >> >> > but
    >> >> > > >> > most of them are far too much trouble to change now that R
    >> is
    >> > more
    >> >> > than
    >> >> > > >> 20
    >> >> > > >> > years old.  And in many cases it will turn out that the
    >> way R
    >> > does
    >> >> > it
    >> >> > > >> > actually does make more sense than the way I would have
    >> done it.
    >> >> > > >> >
    >> >> > > >> > Duncan Murdoch
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >> Cheers,
    >> >> > > >> >>
    >> >> > > >> >> Frederick
    >> >> > > >> >>
    >> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
    >> wrote:
    >> >> > > >> >>
    >> >> > > >> >>> The attached patch.diff will make merge.data.frame()
    >> append the
    >> >> > > >> suffixes
    >> >> > > >> >>> to
    >> >> > > >> >>> columns with common names between by.x and names(y).
    >> >> > > >> >>>
    >> >> > > >> >>> Best,
    >> >> > > >> >>>
    >> >> > > >> >>> Scott Ritchie
    >> >> > > >> >>>
    >> >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
    >> >> > s.ritchie73 at gmail.com>
    >> >> > > >> >>> wrote:
    >> >> > > >> >>>
    >> >> > > >> >>> Hi Frederick,
    >> >> > > >> >>>>
    >> >> > > >> >>>> I would expect that any duplicate names in the resulting
    >> >> > data.frame
    >> >> > > >> >>>> would
    >> >> > > >> >>>> have the suffixes appended to them, regardless of
    >> whether or
    >> > not
    >> >> > they
    >> >> > > >> >>>> are
    >> >> > > >> >>>> used as the join key. So in my example I would expect
    >> > "names.x"
    >> >> > and
    >> >> > > >> >>>> "names.y" to indicate their source data.frame.
    >> >> > > >> >>>>
    >> >> > > >> >>>> While careful reading of the documentation reveals this
    >> is not
    >> >> > the
    >> >> > > >> >>>> case, I
    >> >> > > >> >>>> would argue the intent of the suffixes functionality
    >> should
    >> >> > equally
    >> >> > > >> be
    >> >> > > >> >>>> applied to this type of case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> If you agree this would be useful, I'm happy to write a
    >> patch
    >> > for
    >> >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
    >> > intend
    >> >> > to do
    >> >> > > >> >>>> the
    >> >> > > >> >>>> same for merge.data.table in the data.table package
    >> where I
    >> >> > initially
    >> >> > > >> >>>> encountered the edge case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> Best,
    >> >> > > >> >>>>
    >> >> > > >> >>>> Scott
    >> >> > > >> >>>>
    >> >> > > >> >>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
    >> >> > > >> >>>>
    >> >> > > >> >>>> Hi Scott,
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> It seems like reasonable behavior to me. What result
    >> would
    >> > you
    >> >> > > >> expect?
    >> >> > > >> >>>>> That the second "name" should be called "name.y"?
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> The "merge" documentation says:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>      If the columns in the data frames not used in
    >> merging
    >> > have
    >> >> > any
    >> >> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and
    >> ‘".y"’
    >> > by
    >> >> > > >> default)
    >> >> > > >> >>>>>      appended to try to make the names of the result
    >> unique.
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Since the first "name" column was used in merging,
    >> leaving
    >> > both
    >> >> > > >> >>>>> without a suffix seems consistent with the
    >> documentation...
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Frederick
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
    >> > wrote:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> Hi,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> I was unable to find a bug report for this with a
    >> cursory
    >> >> > search,
    >> >> > > >> but
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>> would
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> like clarification if this is intended or unavoidable
    >> >> > behaviour:
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> ```{r}
    >> >> > > >> >>>>>> # Create example data.frames
    >> >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin",
    >> "Lex"),
    >> >> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
    >> >> > > >> >>>>>>                        age=c(41, 43, 36, 51))
    >> >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max",
    >> "Qin"),
    >> >> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
    >> >> > "Kai-lee"),
    >> >> > > >> >>>>>>                         sex=c("M", "M", "F"),
    >> >> > > >> >>>>>>                         age=c(5,8,7))
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
    >> >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y =
    >> "parent")
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Output:
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
    >> >> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
    >> >> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
    >> >> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
    >> >> > > >> >>>>>> Warning message:
    >> >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name",
    >> by.y =
    >> >> > > >> >>>>>> "parent") :
    >> >> > > >> >>>>>>    column name ‘name’ is duplicated in the result
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Kind Regards,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Scott Ritchie
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>        [[alternative HTML version deleted]]
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> ______________________________________________
    >> >> > > >> >>>>>> R-devel at r-project.org mailing list
    >> >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>
    >> >> > > >> >>>>
    >> >> > > >> >>>>
    >> >> > > >> >> Index: src/library/base/R/merge.R
    >> >> > > >> >>> ==============================
    >> ==============================
    >> >> > =======
    >> >> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
    >> >> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
    >> >> > > >> >>> @@ -157,6 +157,15 @@
    >> >> > > >> >>>           }
    >> >> > > >> >>>             if(has.common.nms) names(y) <- nm.y
    >> >> > > >> >>> +        ## If by.x %in% names(y) then duplicate column
    >> names
    >> >> > still
    >> >> > > >> >>> arise,
    >> >> > > >> >>> +        ## apply suffixes to these
    >> >> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> > > >> >>> +        if(length(dupe.keyx)) {
    >> >> > > >> >>> +          if(nzchar(suffixes[1L]))
    >> >> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
    >> >> > > >> >>> +          if(nzchar(suffixes[2L]))
    >> >> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
    >> >> > > >> >>> +        }
    >> >> > > >> >>>           nm <- c(names(x), names(y))
    >> >> > > >> >>>           if(any(d <- duplicated(nm)))
    >> >> > > >> >>>               if(sum(d) > 1L)
    >> >> > > >> >>>
    >> >> > > >> >>
    >> >> > > >> >> ______________________________________________
    >> >> > > >> >> R-devel at r-project.org mailing list
    >> >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >>
    >> >> > > >>         [[alternative HTML version deleted]]
    >> >> > > >>
    >> >> > > >> ______________________________________________
    >> >> > > >> R-devel at r-project.org mailing list
    >> >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >>
    >> >> > > >
    >> >> > > >
    >> >> > > >
    >> >> > > > --
    >> >> > > > Gabriel Becker, PhD
    >> >> > > > Scientist (Bioinformatics)
    >> >> > > > Genentech Research
    >> >> > > >
    >> >> >
    >> >> > > Index: src/library/base/R/merge.R
    >> >> > > ============================================================
    >> =======
    >> >> > > --- src/library/base/R/merge.R        (revision 74264)
    >> >> > > +++ src/library/base/R/merge.R        (working copy)
    >> >> > > @@ -157,6 +157,15 @@
    >> >> > >          }
    >> >> > >
    >> >> > >          if(has.common.nms) names(y) <- nm.y
    >> >> > > +        ## If by.x %in% names(y) then duplicate column names
    >> still
    >> >> > arise,
    >> >> > > +        ## apply suffixes to these
    >> >> > > +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> > > +        if(length(dupe.keyx)) {
    >> >> > > +          if(nzchar(suffixes[1L]))
    >> >> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> >> > > +          if(nzchar(suffixes[2L]))
    >> >> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> >> > > +        }
    >> >> > >          nm <- c(names(x), names(y))
    >> >> > >          if(any(d <- duplicated(nm)))
    >> >> > >              if(sum(d) > 1L)
    >> >> >
    >> >> > > ______________________________________________
    >> >> > > R-devel at r-project.org mailing list
    >> >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> >
    >> >> >
    >> 
    >> >> Index: src/library/base/R/merge.R
    >> >> ===================================================================
    >> >> --- src/library/base/R/merge.R        (revision 74280)
    >> >> +++ src/library/base/R/merge.R        (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> paste(dupe.keyx,
    >> > suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >> 
    >> > [[alternative HTML version deleted]]
    >> 
    >> > ______________________________________________
    >> > R-devel at r-project.org mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 

    > [[alternative HTML version deleted]]