[R] Removing & generating data by category

David Winsemius dwinsemius at comcast.net
Thu Oct 29 02:54:21 CET 2009


On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:

> Dear R users,
>
>
> Basically, from the following arbitrary data set:
>
> a <-
> data
> .frame
> (id
> =
> c
> (c
> ("A1
> ","A2
> ","A3
> ","A4
> ","A5
> "),c
> ("A3
> ","A2
> ","A3
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"), 
> 6),rep("Life",4)))
>
>> a
>    id   loc  clm
> 1  A1  B1 General
> 2  A2  B2 General
> 3  A3  B3 General
> 4  A4  B4 General
> 5  A5  B5 General
> 6  A3  B1 General
> 7  A2  B2    Life
> 8  A3  B3    Life
> 9  A4  B4    Life
> 10 A5  B5    Life
>
> I desire removing records (highlighted records above) with identical  
> values
> in each fields ("id" & "loc") but with different value of "clm" (i.e
> according to category)

Take a look at this merge operation on separate rows of "a".

 > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id",  
"loc"), all=T)
   id loc clm.x   clm.y
1 A1  B1  <NA> General
2 A2  B2  Life General
3 A3  B1  <NA> General
4 A3  B3  Life General
5 A4  B4  Life General
6 A5  B5  Life General

Assignment of that object and selection with is.na should complete the  
process.

 > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] ,  
by=c("id", "loc"), all=T)

 > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ]
   id loc clm.x   clm.y
1 A1  B1  <NA> General
3 A3  B1  <NA> General

Alternate methods might include paste-ing id to loc and removing  
duplicates.


> i.e
>> categ <- table(a$id,a$clm)
>> categ
>
>     General Life
>  A1       1    0
>  A2       1    1
>  A3       2    1
>  A4       1    1
>  A5       1    1
>
> The desired output is
>
>    id   loc  clm
> 1  A1  B1 General
> 6  A3  B1 General
>
> Because the data set I am working on is quite big (~ 800,000 x 20)
> with majority of the fields values being long strings, looping  
> turned out to
> be very inefficient in comapring individual rows..
>
> Are there any alternative efficient methods in implementing this  
> problem?
> Steven
-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list