[R] Removing rows with earlier dates

William Dunlap wdunlap at tibco.com
Fri Dec 24 19:44:50 CET 2010


> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ali Salekfard
> Sent: Friday, December 24, 2010 5:46 AM
> To: r-help at r-project.org
> Subject: [R] Removing rows with earlier dates
> 
> Hi all,
> 
> I'm new to the list but have benfited from it quite 
> extensively. Straight to
> my rather strange question:
> 
> I have a data frame that contains mapping rules in this way:
> 
> ACCOUNT, RULE COLUMNS, Effective Date
> 
> 
> The dataframe comes from a database that stores all dates. 
> What I would like
> to do is to create a data frame with only the most recent 
> rule for each
> account. In traditional programming languages I would loop 
> through each
> account find the most recent rule(s) and fill up my updated 
> data frame.

Do you mean you would loop through the accounts and for
each account loop through all rules for that account looking
for the rule with the latest date?

> 
> Does anyone have any better idea to use R's magic (Its syntax is still
> magical to me) for this problem?

I like to think of this sort of a problem as a one involving
"runs" (sequences of identical data points).  The following
function identifies which data points are the last in a run:

  isLastInRun <- function (x, ...) 
  {
      retval <- c(x[-1] != x[-length(x)], TRUE)
      for (y in list(...)) {
          stopifnot(length(x) == length(y))
          retval <- retval | c(x[-1] != x[-length(x)], TRUE)
      }
      retval
  }

E.g.,
  > isLastInRun(rep(1:2,each=5), rep(1:3,c(3,3,4)))
   [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE

If your data is sorted by ACCOUNT with ties broken by
`Effective Date` then you can get what I think you want with
    f0 <- function(data) {
        data[isLastInRun(data[,"ACCOUNT"]), ]
    }
If it is not sorted then sort it first and then do the above with
    f1 <- function(data) {
        data <- data[order(data[,"ACCOUNT"], data[,"Effective Date"]), ]
        f0(data)
    }
If it is important that the output records be in the same order
as the input records then you can use
    f2 <- function(data) {
        o <- order(data[, "ACCOUNT"], data[, "Effective Date"])
        tmp <- logical(length(o))
        tmp[o] <- isLastInRun(data[o, "ACCOUNT"])
        data[tmp,]
    }
(It may be faster to sort the output of f1 rather that
sorting the input, as f2 does, but f2's method is a bit
simpler to write.)

This approach is typically faster than tapply when there
are a lot of small groups and runs less risk of corrupting
the data when using unusual classes in the columns of your
data.frame.

Typical usage is

   > d <- data.frame(check.names=FALSE,
   +       "Effective Date"=c(1004,1008,1004,1007,1005),
   +       ACCOUNT=         c(  11,  12,  11,  12,  11),
   +       Rule=            c( "A", "B", "C", "D", "E"))
   > f1(d)
     Effective Date ACCOUNT Rule
   5           1005      11    E
   2           1008      12    B
   > f2(d)
     Effective Date ACCOUNT Rule
   2           1008      12    B
   5           1005      11    E

Did I interpret your question correctly?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> By the way the list of rules is quite extensive (144643 lines to be
> precise), and there are usually 1-3 most recent rules (rows) for each
> account.
> 
> Thanks.
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list