[R] Subset dataframe with loop searching for unique values in two columns

arun smartpink111 at yahoo.com
Sun Jun 9 01:47:45 CEST 2013


Hi,
You could try this:

dat2<- read.table(text='
 case pin some_data
 "A"  "1" "data"  
"A"  "2" "data"  
"A"  "1" "data"  
"A"  "2" "data"  
"B"  "1" "data"  
"B"  "2" "data"
',sep="",header=TRUE,stringsAsFactors=FALSE)  
dat2[!duplicated(dat2[,1:2]),]
#  case pin some_data
#1    A   1      data
#2    A   2      data
#5    B   1      data
#6    B   2      data
#or

 dat2[row.names(unique(dat2[,1:2])),] ##assuming that the third column is different for the duplicated `case` and `pin`
 # case pin some_data
#1    A   1      data
#2    A   2      data
#5    B   1      data
#6    B   2      data


#If `some_data` is same for duplicated rows:
unique(dat2)
#  case pin some_data
#1    A   1      data
#2    A   2      data
#5    B   1      data
#6    B   2      data


A.K.


Hello, 

First off, I'm sure that this is posted somewhere but I've not 
been able to find what I'm looking for. Please forgive the duplication 
and thank you for your help!!!! 

I have a crime dataset of over 500k observations in one file. To
 simplify my problem, I have a dataframe that has a "case" ID in one 
column, a personal ID number (pin) in another, and associated "data" in 
subsequent columns. 

Example: 
     case pin some_data 
[1,] "A"  "1" "data"   
[2,] "A"  "2" "data"   
[3,] "A"  "1" "data"   
[4,] "A"  "2" "data"   
[5,] "B"  "1" "data"   
[6,] "B"  "2" "data"   

I would like to subset the data so that only unique PINs and CASES are left with the subsequent data 

     case pin some_data 
[1,] "A"  "1" "data"   
[2,] "A"  "2" "data"   
  
[5,] "B"  "1" "data"   
[6,] "B"  "2" "data"   

I'm teaching my self how to program in R and I'm thinking that I want a loop to say something like: 
- find and keep first row of unique PIN & CASE 
- if PIN is duplicate but CASE is different, keep first row of dupe PIN & new CASE 

Longer Explanation: 
The PIN identifies an arrested offender. I want to check and see if 
there was recidivism, repeat offenses and arrests, for each 
offender/PIN. The way I can do that is by checking whether a PIN has 
multiple CASE numbers. I also want to keep the single arrests in the 
dataset too. I have over 6 million cases for several years. 

I hope this makes sense, I've been banging my head for a while on this one and really would appreciate the help!!



More information about the R-help mailing list