[R] formatting a 6 million row data set; creating a censoring variable

Wed Aug 31 21:32:55 CEST 2011

I'll assume that all of an individual's data rows
are contiguous and that an individual always passes through
the groups in order (or, least, the individual
never leaves a group and then reenters it), so we
can find everything we need to know by comparing each
row with the previous row.

You can use rle() to quickly make the time
column: 
  > rle(paste(d$mygroup, d$id))$lengths
  [1] 1 3 3 6 3 3 1

For the censor column it is probably easiest to consider
what rle() must do internally and use a modification of that.
E.g.,
  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
  isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
  outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id)
  output <- d[outputRows, ]
  output$mytime <- diff(c(0, which(outputRows)))
  output$censor <- as.integer(isLastInRun(e$id))
which gives you
  > output
     gender mygroup id mytimes censor
  1       F       A  1       1      1
  4       F       B  2       3      0
  7       F       C  2       3      0
  13      F       D  2       6      1
  16      M       A  3       3      0
  19      M       B  3       3      1
  20      M       A  4       1      1
You showed a rearrangment of the columns
  > output[, c("id", "mygroup", "mytime", "censor")]
     id mygroup mytime censor
  1   1       A      1      1
  4   2       B      3      0
  7   2       C      3      0
  13  2       D      6      1
  16  3       A      3      0
  19  3       B      3      1
  20  4       A      1      1
This ought to be quicker than plyr, but data.table
may do similar run-oriented operations.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Juliet Hannah
> Sent: Wednesday, August 31, 2011 10:51 AM
> To: r-help at r-project.org
> Subject: [R] formatting a 6 million row data set; creating a censoring variable
> 
> List,
> 
> Consider the following data.
> 
>    gender mygroup id
> 1       F       A  1
> 2       F       B  2
> 3       F       B  2
> 4       F       B  2
> 5       F       C  2
> 6       F       C  2
> 7       F       C  2
> 8       F       D  2
> 9       F       D  2
> 10      F       D  2
> 11      F       D  2
> 12      F       D  2
> 13      F       D  2
> 14      M       A  3
> 15      M       A  3
> 16      M       A  3
> 17      M       B  3
> 18      M       B  3
> 19      M       B  3
> 20      M       A  4
> 
> Here is the reshaping I am seeking (explanation below).
> 
>      id mygroup mytime censor
> [1,]  1       A      1      1
> [2,]  2       B      3      0
> [3,]  2       C      3      0
> [4,]  2       D      6      1
> [5,]  3       A      3      0
> [6,]  3       B      3      1
> [7,]  4       A      1      1
> 
> I need to create 2 variables. The first one is a time variable.
> Observe that for id=2, the variable mygroup=B was observed 3 times. In
> the solution we see in row 2 that id=2 has a mytime variable of 3.
> 
> Next, I need to create a censoring variable.
> 
> Notice id=2 goes through has values of B, C, D for mygroup. This means
> the change from B to C and C to D is observed.  There is no change
> from D. I need to indicate this with a 'censoring' variable. So B and
> C would have values 0, and D would have a value of 1. As another
> example, id=1 never changes, so I assign it  censor= 1. Overall, if a
> change is observed, 0 should be assigned, and if a change is not
> observed 1 should be assigned.
> 
> One potential challenge is that the original data set has over 5
> million rows. I have ideas, but I'm still getting used the the
> data.table and plyr syntax.  I also seek a base R solution. I'll post
> the timings on the real data set shortly.
> 
> Thanks for your help.
> 
> > sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> # Here is a simplified data set
> 
> myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
> "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
> ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
> "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
> "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
> "id"), class = "data.frame", row.names = c(NA, -20L))
> 
> 
> # here is plyr solution with  idata.frame
> 
> library(plyr)
> imyData <-  idata.frame(myData)
> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
> mytime = length(mygroup)))
> 
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> 
> plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup))
> 
> 
> # here is a data table solution
> # use makeCensor function from above
> 
> library(data.table)
> mydt <- data.table(myData)
> setkey(mydt,id,mygroup)
> 
> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
> datatableSolution <- cbind(timeData,mycensor[,list(censor)])
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.