[R] Optimize for loop / find last record for each person

William Dunlap wdunlap at tibco.com
Fri Feb 27 22:10:20 CET 2009


Andrew, it makes it easier to help if you supply a typical
input and expected output along with your code.  I tried
your code with the following input:
> history
   person_id date
1       Mary    1
2       Mary    2
3        Sue    3
4       Alex    4
5        Joe    5
6       Alex    6
7       Alex    7
8        Sue    8
9        Sue    9
10       Joe   10
made with the function
f<-function(n){
   cached.rs <- .Random.seed
   on.exit(.Random.seed<<-cached.rs)
   set.seed(1)
   data.frame(person_id=sample(c("Joe","Mary","Sue","Alex"),
        size=n,replace=TRUE), date=seq_len(n))
}
and it failed because there was no column called 'order'.

The following function, f2, does what I think you are saying
you want.  It sorts the data by person_id, breaking ties with
date, and then selects the rows where the person_id entry does
not match the person_id entry in the next row.  It then sorts
the result by date.  (I don't know if the last sort it needed
in your application.)  It should be pretty quick for long
datasets with lots of distinct person_id values.

f2 <-function (history)
{
   # assume history has, at least, columns called "person_id" and "date"
   # Return rows containing the last entry (by date) for each person.
   last <- function(x) c(x[-1]!=x[-length(x)], TRUE)
   history <- history[with(history, order(person_id,date)),,drop=FALSE]
   history <- history[last(history[,"person_id"]),,drop=FALSE]
   history[order(history$date),,drop=FALSE]
}

> f2(history)
   person_id date
2       Mary    2
7       Alex    7
9        Sue    9
10       Joe   10

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

---------------------------------------------------------------
[R] Optimize for loop / find last record for each person

Andrew Ziem ahz001 at gmail.com 
Fri Feb 27 20:02:31 CET 2009

I want to find the last record for each person_id in a data frame
(from a SQL database) ordered by date.  Is there a better way than
this for loop?

for (i in 2:length(history[,1])) {
    if (history[i, "person_id"] == history[i - 1, "person_id"])
      history[i, "order"] = history[i - 1, "order"] + 1 # same person
    else
      history[i, "order"] = 1 # new person
}

# ignore all records except the last for each con_id
history2 <- subset(history, order == 1)


Andrew




More information about the R-help mailing list