[Rd] setdiff bizarre (was: odd behavior out of setdiff)

G. Jay Kerns gkerns at ysu.edu
Sat May 30 06:35:22 CEST 2009


Dear R-devel,

Please see the recent thread on R-help, "Odd Behavior Out of
setdiff(...) - addition of duplicate entries is not identified" posted
by Jason Rupert.  I gave an answer, then read David Winsemius' answer,
and then did some follow-up investigation.

I would like to change my answer.

My current version of setdiff() is acting in a way that I do not
understand, and a way that I suspect  has changed.  Consider the
following, derived from Jason's OP:

The base package setdiff(), atomic vectors:

x <- 1:100
y <- c(x,x)

setdiff(x, y)  # integer(0)
setdiff(y, x)  # integer(0)

z <- 1:25

setdiff(x,z)   # 26:100
setdiff(z,x)   # integer(0)


Everything is fine.

Now look at base package setdiff(), data frames???

################################
A <- data.frame(x = 1:100)
B <- rbind(A, A)

setdiff(A, B)               # df 1:100?
setdiff(B, A)               # df 1:100?

C <- data.frame(x = 1:25)

setdiff(A, C)               # df 1:100?
setdiff(C, A)               # df 1:25?

############################


I have read ?setdiff 37 times now, and I cannot divine any
interpretation that matches the above output.  From the source, it
appears that

match(x, y, 0L) == 0L

is evaluating to TRUE, of length equal to the columns of x, and then

x[match(x, y, 0L) == 0L]

is returning the entire data frame.

Compare with the output from package "prob", which uses a setdiff that
operates row-wise:


###########################
library(prob)
A <- data.frame(x = 1:100)
B <- rbind(A, A)

setdiff(A, B)               # integer(0)
setdiff(B, A)               # integer(0)

C <- data.frame(x = 1:25)

setdiff(A, C)               # 26:100
setdiff(C, A)               # integer(0)



IMHO, the entire notion of "set" and "element" is problematic in the
df case, so I am not advocating the adoption of the prob:::setdiff
approach;  rather, setdiff is behaving in a way that I cannot believe
with my own eyes, and I would like to alert those who can speak as to
why this may be happening.

Thanks to Jason for bringing this up, and to David for catching the discrepancy.

Session info is below.  I use the binaries prepared by the Debian
group so I do not have the latest patched-revision-4440986745343b.
This must have been related to something which has been fixed since
April 17, and in that case, please disregard my message.

Yours truly,
Jay






> sessionInfo()
R version 2.9.0 (2009-04-17)
x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] prob_0.9-1

















-- 

***************************************************
G. Jay Kerns, Ph.D.
Associate Professor
Department of Mathematics & Statistics
Youngstown State University
Youngstown, OH 44555-0002 USA
Office: 1035 Cushwa Hall
Phone: (330) 941-3310 Office (voice mail)
-3302 Department
-3170 FAX
E-mail: gkerns at ysu.edu
http://www.cc.ysu.edu/~gjkerns/



More information about the R-devel mailing list