[Rd] setdiff bizarre (was: odd behavior out of setdiff)

Jason Rupert jasonkrupert at yahoo.com
Sat May 30 21:30:30 CEST 2009


Jay, 


I really appreciate all your help help.  

I posted to Nabble an R file and input CSV files more accurately demonstrating what I am seeing and the output I desire to achieve when I difference two dataframes.  
http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html


It may be that "setdiff" as intended in the base R functionality and "prob" was never intended to provide the type of result I desire.  If that is the case then I will need to ask the "Ninjas" for help to produce the out come I seek.  

That is, when I different the data within RSetDiffEntry.csv and RSetDuplicatesRemoved.csv, I desire to get the result shown in  RDesired.csv. 

Note that, it would not be enough to just work to remove duplicate "CostPerSquareFoot" values, since that variable is tied to "EntryDate" and "HouseNumber".  

Any further help and insights are much appreciated. 

Thanks again, 
Jason 





--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:

> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: setdiff bizarre (was: odd behavior out of setdiff)
> To: r-devel at r-project.org
> Cc: dwinsemius at comcast.net, jasonkrupert at yahoo.com
> Date: Friday, May 29, 2009, 11:35 PM
> Dear R-devel,
> 
> Please see the recent thread on R-help, "Odd Behavior Out
> of
> setdiff(...) - addition of duplicate entries is not
> identified" posted
> by Jason Rupert.  I gave an answer, then read David
> Winsemius' answer,
> and then did some follow-up investigation.
> 
> I would like to change my answer.
> 
> My current version of setdiff() is acting in a way that I
> do not
> understand, and a way that I suspect  has
> changed.  Consider the
> following, derived from Jason's OP:
> 
> The base package setdiff(), atomic vectors:
> 
> x <- 1:100
> y <- c(x,x)
> 
> setdiff(x, y)  # integer(0)
> setdiff(y, x)  # integer(0)
> 
> z <- 1:25
> 
> setdiff(x,z)   # 26:100
> setdiff(z,x)   # integer(0)
> 
> 
> Everything is fine.
> 
> Now look at base package setdiff(), data frames???
> 
> ################################
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
> 
> setdiff(A, B)           
>    # df 1:100?
> setdiff(B, A)           
>    # df 1:100?
> 
> C <- data.frame(x = 1:25)
> 
> setdiff(A, C)           
>    # df 1:100?
> setdiff(C, A)           
>    # df 1:25?
> 
> ############################
> 
> 
> I have read ?setdiff 37 times now, and I cannot divine any
> interpretation that matches the above output.  From
> the source, it
> appears that
> 
> match(x, y, 0L) == 0L
> 
> is evaluating to TRUE, of length equal to the columns of x,
> and then
> 
> x[match(x, y, 0L) == 0L]
> 
> is returning the entire data frame.
> 
> Compare with the output from package "prob", which uses a
> setdiff that
> operates row-wise:
> 
> 
> ###########################
> library(prob)
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
> 
> setdiff(A, B)           
>    # integer(0)
> setdiff(B, A)           
>    # integer(0)
> 
> C <- data.frame(x = 1:25)
> 
> setdiff(A, C)           
>    # 26:100
> setdiff(C, A)           
>    # integer(0)
> 
> 
> 
> IMHO, the entire notion of "set" and "element" is
> problematic in the
> df case, so I am not advocating the adoption of the
> prob:::setdiff
> approach;  rather, setdiff is behaving in a way that I
> cannot believe
> with my own eyes, and I would like to alert those who can
> speak as to
> why this may be happening.
> 
> Thanks to Jason for bringing this up, and to David for
> catching the discrepancy.
> 
> Session info is below.  I use the binaries prepared by
> the Debian
> group so I do not have the latest
> patched-revision-4440986745343b.
> This must have been related to something which has been
> fixed since
> April 17, and in that case, please disregard my message.
> 
> Yours truly,
> Jay
> 
> 
> 
> 
> 
> 
> > sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices
> utils     datasets 
> methods   base
> 
> other attached packages:
> [1] prob_0.9-1
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
> 






More information about the R-devel mailing list