[R] Dropping "trailing zeroes" in longitudinal data

Mon Apr 26 22:17:09 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of David Atkins
> Sent: Monday, April 26, 2010 12:23 PM
> To: r-help at r-project.org
> Subject: [R] Dropping "trailing zeroes" in longitudinal data
> 
> 
> Background: Our research group collected data from students 
> via the web 
> about their drinking habits (alcohol) over the last 90 days.  As you 
> might guess, some students seem to have lost interest and 
> completed some 
> information but not all.  Unfortunately, the survey was programmed to 
> "pre-populate" the fields with zeroes (to make it easier for 
> students to 
> complete).
> 
> Obviously, when we see a stretch of zeroes, we've no idea 
> whether this 
> is "true" data or not, but we'd like to at least do some sensitivity 
> analyses by dropping "trailing zeroes" (ie, when there are non-zero 
> responses for some duration of the data that then "flat line" 
> into all 
> zeroes to the end of the time period)
> 
> I've included a toy dataset below.
> 
> Basically, we have the data in the "long" format, and what 
> I'd like to 
> do is subset the data.frame by deleting rows that occur at 
> the end of a 
> person's data that are all zeroes.  In a nutshell, select rows from a 
> person that are continuously zero, up to first non-zero, 
> starting at the 
> end of their data (which, below, would be time = 10).
> 
> With the toy data, this would be the last 6 rows of ids #10 
> and #8 (for 
> example).  I can begin to think about how I might do this via 
> grep/regexp but am a bit stumped about how to translate that to this 
> type of data.
> 
> Any thoughts appreciated.
> 
> cheers, Dave
> 
> ### toy dataset
> set.seed(123)
> toy.df <- data.frame(id = factor(rep(1:10, each=10)),
> 						time = rep(1:10, 10),
> 					   dv = rnbinom(100, mu 
> = 0.5, size = 100))
> toy.df
> 
> library(lattice)
> 
> xyplot(dv ~ time | id, data = toy.df, type = c("g","l"))

Try using rle (run length encoding) along with either ave()
or lapply().  E.g., define the function

isInTrailingRunOfZeroes <- function (x, group, minRunLength = 1) {
    as.logical(ave(x, group, FUN = function(x) {
        r <- rle(x)
        n <- length(r$values)
        if (n == 0) {
            logical(0)
        } else if (r$values[n] == 0 && r$lengths[n] >= minRunLength) {
            rep(c(FALSE, TRUE), c(sum(r$lengths[-n]), r$lengths[n]))
        } else {
            rep(FALSE, sum(r$lengths))
        }
    }))
}

and use it to drop the trailing runs of 0's with
    xyplot(data=toy.df[!isInTrailingRunOfZeroes(toy.df$dv, toy.df$id),],
           dv~time|id, type=c("g","l"))
or replace them with NA's with
    toy.df.copy <- toy.df
    toy.df.copy[isInTrailingRunOfZeroes(toy.df.copy$dv,
toy.df.copy$id),"dv"] <- NA

The last argument, minRunLength lets you say you only want
to consider the data spurious if there are at least that many
zeroes.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 
> 
> -- 
> Dave Atkins, PhD
> Research Associate Professor
> Department of Psychiatry and Behavioral Science
> University of Washington
> datkins at u.washington.edu
> 
> Center for the Study of Health and Risk Behaviors (CSHRB)		
> 1100 NE 45th Street, Suite 300 	
> Seattle, WA  98105 	
> 206-616-3879 	
> http://depts.washington.edu/cshrb/
> (Mon-Wed)	
> 
> Center for Healthcare Improvement, for Addictions, Mental Illness,
>    Medically Vulnerable Populations (CHAMMP)
> 325 9th Avenue, 2HH-15
> Box 359911
> Seattle, WA 98104?
> 206-897-4210
> http://www.chammp.org
> (Thurs)
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>