[R] Matching on multiple columns

Charles C. Berry cberry at tajo.ucsd.edu
Thu Jan 11 23:26:05 CET 2007


On Thu, 11 Jan 2007, Farrel Buchinsky wrote:

> Am I correct in believing that one cannot match on multiple columns?
> One can indeed subset on multiple criteria from different variables
> (or columns) but not from unique combinations thereof.
> I need to exclude about 10000 rows from 108000 rows of data based on
> several unique combinations of identifiers in two columns. Only
> merge() seems to be able to do that. Merge would allow me to
> positively select but it would not allow me to deselect (or exclude).
> Look at how I got around the problem.
> It is inelegant. Have a missed a more direct function?
>


I guess that depends on what you regard as 'more direct'.

  	newdata[ !is.element( interaction(newdata), interaction(exclude) ) , ]


>
> x<-seq(from=1,to=10)
> y <-rep(1:2,5)
> data<-data.frame(cbind(x,y))
> x<-seq(from=1,to=10)
> y <-rep(1:5,2)
> data1<-data.frame(cbind(x,y))
> newdata<-rbind(data,data1)
> newdata[10,2] <- 3
> exclude<-newdata[18:20,]
> #This is a simulation of real life problem
> #We now have two dataframes. Newdata is the data set from an
> experiment. In real #life it has an additional 9 columns of data
> #exclude is a dataframe that was manually created after discovering
> some quality #control issues
> #Any row in newdata that matches any row in exclude must be discarded
> match(newdata$x,exclude$x)# useless because it is only on one column
> match(newdata$y,exclude$y)# useless because it is only on one column
> newdata$x %in%  exclude$x  # useless because it is only on one column
> newdata$y %in%  exclude$y # useless because it is only on one column
> newdata$x %in%  exclude$x &  newdata$y %in%  exclude$y  # useless,
> #eventhough it is using both columns, because it is not
> #using them in a synchronous manner. Row 10 in new data should not
> have been #marked "TRUE"
> #It was only labeled such because the 10 in the x column is indeed in
> the exclude x #column and the 3 in the y column is indeed in the
> exclude y column but not #together
> which(newdata$x %in%  exclude$x &  newdata$y %in%  exclude$y)#also
> gets it #wrong
> match(newdata,exclude)# intuitively this could have worked but alas
> match can #only handle vectors and not dataframes. It cannot match on
> multiple columns
> #I have to stoop to the inelegant maneuver of creating a combined
> variable of the #two columns, albeit only temporarily
> paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")
> #or one could do this
> match(paste(newdata$x,newdata$y,sep=":"),paste(exclude$x,exclude$y,sep=":"))
> #or
> {which(paste(newdata$x,newdata$y,sep=":") %in% paste
> (exclude$x,exclude$y,sep=":"))}
> #Or if you want to use the result in an index term or a selection
> argument in a subset command
> {paste(newdata$x,newdata$y,sep=":") %in%
> paste(exclude$x,exclude$y,sep=":")==FALSE}
> #As in
> {newdata[paste(newdata$x,newdata$y,sep=":") %in%
> paste(exclude$x,exclude$y,sep=":")==FALSE,]}
> #or
> {subset(newdata,paste(newdata$x,newdata$y,sep=":") %in%
> paste(exclude$x,exclude$y,sep=":")==FALSE)}
>
>
> -- 
> Farrel Buchinsky
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0901



More information about the R-help mailing list