[Rd] setdiff bizarre (was: odd behavior out of setdiff)

Stavros Macrakis macrakis at alum.mit.edu
Sat May 30 14:50:09 CEST 2009

```It seems to me that, abstractly, a dataframe is just as
straightforwardly a sequence of tuples/observations as a vector is a
sequence of scalars. R's convention is that a 1-vector represents a
scalar, and similarly, a 1-dataframe can represent a tuple (though it
can also be represented as a list). Of course, a dataframe can *also*
be interpreted as a list of vectors.

Just as a sequence of scalars can be interpreted as a set of scalars
by the order- and repetition-ignoring homomophism, so can a sequence
of tuples. It seems to me natural that set operations should follow
that interpretation.

-s

On 5/30/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> Dear R-devel,
>
> Please see the recent thread on R-help, "Odd Behavior Out of
> setdiff(...) - addition of duplicate entries is not identified" posted
> and then did some follow-up investigation.
>
> I would like to change my answer.
>
> My current version of setdiff() is acting in a way that I do not
> understand, and a way that I suspect  has changed.  Consider the
> following, derived from Jason's OP:
>
> The base package setdiff(), atomic vectors:
>
> x <- 1:100
> y <- c(x,x)
>
> setdiff(x, y)  # integer(0)
> setdiff(y, x)  # integer(0)
>
> z <- 1:25
>
> setdiff(x,z)   # 26:100
> setdiff(z,x)   # integer(0)
>
>
> Everything is fine.
>
> Now look at base package setdiff(), data frames???
>
> ################################
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
>
> setdiff(A, B)               # df 1:100?
> setdiff(B, A)               # df 1:100?
>
> C <- data.frame(x = 1:25)
>
> setdiff(A, C)               # df 1:100?
> setdiff(C, A)               # df 1:25?
>
> ############################
>
>
> I have read ?setdiff 37 times now, and I cannot divine any
> interpretation that matches the above output.  From the source, it
> appears that
>
> match(x, y, 0L) == 0L
>
> is evaluating to TRUE, of length equal to the columns of x, and then
>
> x[match(x, y, 0L) == 0L]
>
> is returning the entire data frame.
>
> Compare with the output from package "prob", which uses a setdiff that
> operates row-wise:
>
>
> ###########################
> library(prob)
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
>
> setdiff(A, B)               # integer(0)
> setdiff(B, A)               # integer(0)
>
> C <- data.frame(x = 1:25)
>
> setdiff(A, C)               # 26:100
> setdiff(C, A)               # integer(0)
>
>
>
> IMHO, the entire notion of "set" and "element" is problematic in the
> df case, so I am not advocating the adoption of the prob:::setdiff
> approach;  rather, setdiff is behaving in a way that I cannot believe
> with my own eyes, and I would like to alert those who can speak as to
> why this may be happening.
>
> Thanks to Jason for bringing this up, and to David for catching the
> discrepancy.
>
> Session info is below.  I use the binaries prepared by the Debian
> group so I do not have the latest patched-revision-4440986745343b.
> This must have been related to something which has been fixed since
> April 17, and in that case, please disregard my message.
>
> Yours truly,
> Jay
>
>
>
>
>
>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
>
> locale:
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] prob_0.9-1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

```