[R] Within ID variable delete all rows after reaching a specific value

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sat Apr 26 09:32:45 CEST 2014


Jennifer:

a) Don't post in HTML... read the Posting Guide.

b) Don't make data frames by first making matrices... you rarely create 
what you think you are creating. In your case, your code creates a bunch 
of factor columns... use the str() function to verify that your data are 
sensible before analyzing it.

c) The ave and cumsum functions are useful here:

tmp <- data.frame( X1 = rbinom( 1000, 1, .03 )
                  , X2 = array( 1:127, c(1000,1) )
                  , X3 = array( format( seq( ISOdate(1990,1,1)
                                           , by='month'
                                           , length=56 )
                                      , format='%d.%m.%Y')
                              , c( 1000, 1 ) ) )
tmp <- tmp[ with( tmp, order( X2, X3 ) ), ]
tmp2 <- subset( tmp
               , 1 >= ave( X1
                         , X2
                         , FUN=function( x ) {
                             cumsum( cumsum( x ) )
                           } ) ) )

which generates a vector of increasing values once the first nonzero value 
is found in each group, and then only keeps the rows for which those 
increasing values are zero or one.

On Sat, 26 Apr 2014, Jim Lemon wrote:

> On 04/26/2014 12:42 PM, Jennifer Sabatier wrote:
>> So, I know that's a confusing Subject header.
>> 
>> Here's similar data:
>> 
>> 
>> tmp<- data.frame(matrix(
>>                          c(rbinom(1000, 1, .03),
>>                            array(1:127, c(1000,1)),
>>                            array(format(seq(ISOdate(1990,1,1), by='month',
>> length=56), format='%d.%m.%Y'), c(1000,1))),
>>                          ncol=3))
>> tmp<- tmp[with(tmp, order(X2, X3)), ]
>> table(tmp$X1)
>> 
>> 
>> X1 is the variable of interest - disease status.  It's a survival-type of
>> variable, where you are 0 until you become 1.
>> X2 is the person ID variable.
>> X3 is the clinic date (here it's monthly, just for example...but in my real
>> data it's a bit more complicated - definitely not equally spaced nor the
>> same number of visits to the clinic per ID.).
>> 
>> Some people stay X1 = 0 for all clinic visits.  Only a small proportion
>> become X1=1.
>> 
>> However, the data has errors I need to clean off.  Once someone becomes
>> X1=1 they should have no more rows in the dataset.  These are data entry
>> errors.
>> 
>> In my data I have people who continue to have rows in the data.  Sometimes
>> the rows show X1=0 and sometimes X1=1.  Sometimes there's just one more row
>> and sometimes there are many more rows.
>> 
>> How can I go through, find the first X1 = 1, and then delete any rows after
>> that, for each value of X2?
>> 
>> Thanks!
>> 
>> Jen
>> 
> Hi Jen,
> This might do what you want:
>
> tmp$X3<-as.Date(tmp$X3,"%d.%m.%Y")
> tmp<-tmp[order(tmp$X2,tmp$X3),]
> first<-TRUE
> for(patno in unique(tmp$X2)) {
> cat(patno,"\n")
> tmpbit<-tmp[tmp$X2 == patno,]
> firstone<-which(tmpbit$X1 == 1)[1]
> cat(firstone,"\n")
> if(is.na(firstone)) firstone<-dim(tmpbit)[1]
> newtmpbit<-tmpbit[1:firstone,]
> if(first) {
>  newtmp<-newtmpbit
>  first<-FALSE
> }
> else newtmp<-rbind(newtmp,newtmpbit)
> }
>
> Jim
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k




More information about the R-help mailing list