[R] finding both rows that are duplicated in a data frame

arun smartpink111 at yahoo.com
Sat Sep 7 18:19:15 CEST 2013






Hi,

Suppose you have situations like this: (duplicates are both UNKNOWN and want to remove those)


example1<-rbind(example,data.frame(id1=c(11,12,12),id2=c(93,95,95),GENDER=rep("G-UNK",3),ETH=rep("E-UNK",3)))
spl<- as.character(interaction(example1$id1,example1$id2))
 res1<-do.call(rbind,lapply(split(example1,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
res1<-res1[!(is.na(res1[,3])|is.na(res1[,4])),]  ##remove the rows with NA

 res2<-res1[order(res1$id1),]
 row.names(res2)<- 1:nrow(res2)
 res2
#  id1 id2 GENDER  ETH
#1    1  22    G-M E-VT
#2    2  34    G-M E-AF
#3    3  15    G-M E-AF
#4    4  76    G-F E-VT
#5    5  45    G-F E-VT
#6    6  84    G-F E-AF
#7    7  37    G-F E-AF
#8    8  52    G-F E-AF
#9    9  66    G-F E-AF
#10  10  91    G-F E-VT

A.K.


----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 11:30 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame

HI,
May be this is what you are looking for.


spl<- as.character(interaction(example$id1,example$id2))
res<-do.call(rbind,lapply(split(example,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
 
 res1<-res[order(res$id1),]
 row.names(res1)<-1:nrow(res1)
 res1
#   id1 id2 GENDER  ETH
#1    1  22    G-M E-VT
#2    2  34    G-M E-AF
#3    3  15    G-M E-AF
#4    4  76    G-F E-VT
#5    5  45    G-F E-VT
#6    6  84    G-F E-AF
#7    7  37    G-F E-AF
#8    8  52    G-F E-AF
#9    9  66    G-F E-AF
#10  10  91    G-F E-VT
A.K.



----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 10:52 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame

Hi,
example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE)

res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),]) 
 res
#   id1 id2 GENDER  ETH
#1    1  22    G-M E-VT
#3    2  34    G-M E-AF
#5    3  15    G-M E-AF
#7    4  76    G-F E-VT
#8    5  45    G-F E-VT
#12   7  37    G-F E-AF
#13   8  52    G-F E-AF
#14   9  66    G-F E-AF
#16  10  91    G-F E-VT


It is a bit unclear about the condition for id1 #6.  If I include both of them, the nrows will be 11, now it is 9.

10   6  84  G-UNK  E-AF
11   6  84    G-F E-UNK


A.K.



----- Original Message -----
From: Robert Lynch <robert.b.lynch at gmail.com>
To: R help <r-help at r-project.org>
Cc: 
Sent: Saturday, September 7, 2013 3:02 AM
Subject: [R] finding both rows that are duplicated in a data frame

I have a data frame that looks like

id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE)
ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE)
example<-cbind(id1,id2,GENDER,ETH)

where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK

the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)

yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M & G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure

Thanks!
Robert

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list