[R] Removing rows with earlier dates

Wed Dec 29 22:26:55 CET 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ali Salekfard
> Sent: Wednesday, December 29, 2010 6:25 AM
> To: r-help at r-project.org
> Subject: Re: [R] Removing rows with earlier dates
> 
> Thanks to everyone. Joshua's response seemed the most concise 
> one, but it
> used up so much memory that my R just gave error. I checked the other
> replies and all in all I came up with this, and thought to 
> share it with
> others and get comments.
> 
> My structure was as follows:
> 
> ACCOUNT   RULE  DATE
> A1             xxxx     2010-01-01
> A2             xxxx     2007-05-01
> A2             xxxx     2007-05-01
>  A2             xxxx     2005-05-01
> A2             xxxx     2005-05-01
>  A1             xxxx     2009-01-01

This printout is not really sufficient to tell us what
is in your dataset.  E.g., I tried to convert it to
a data.frame with the following code

my.mapping.Date <- read.table(header=TRUE,
  colClasses=c("character","character","Date"),
  textConnection("
    ACCOUNT RULE      DATE
    A1      Rule1     2010-01-01
    A2      Rule2     2007-05-01
    A2      Rule3     2007-05-01
    A2      Rule4     2005-05-01
    A2      Rule5     2005-05-01
    A1      Rule6     2009-01-01")
)

and your processing code failed in the as.Date(a,"%Y-%m-%d")
step because tapply() corrupts things of class Date (it
turns them into integers).  tapply() often has problems
dealing with nontrivial data classes.

If I read in the DATE column as character data then your
code doesn't crash.  (I did not try it with the default
factors for all columns.)

my.mapping.character <- read.table(header=TRUE,
  colClasses=c("character","character","character"),
  textConnection("
    ACCOUNT RULE      DATE
    A1      Rule1     2010-01-01
    A2      Rule2     2007-05-01
    A2      Rule3     2007-05-01
    A2      Rule4     2005-05-01
    A2      Rule5     2005-05-01
    A1      Rule6     2009-01-01")
)

f0 <- function (my.mapping) 
{
    # your code converted to a function so it doesn't
    # overwrite its input and so it can be easily compared
    # with other functions.
    a <- tapply(my.mapping$DATE, my.mapping$ACCOUNT, max)
    a <- data.frame(ACCOUNT = names(a), DT = as.Date(a, "%Y-%m-%d"))
    my.mapping <- merge(x = my.mapping, y = a, by.x = "ACCOUNT", 
        by.y = "ACCOUNT")
    my.mapping <- cbind(my.mapping, TAKE = my.mapping$DATE == 
        my.mapping$DT)
    my.mapping <- my.mapping[my.mapping$TAKE == TRUE, ]
    my.mapping
}

> f0(my.mapping.character)
  ACCOUNT  RULE       DATE         DT TAKE
1      A1 Rule1 2010-01-01 2010-01-01 TRUE
3      A2 Rule2 2007-05-01 2007-05-01 TRUE
4      A2 Rule3 2007-05-01 2007-05-01 TRUE

In your original post you wrote
  > What I would like to do is to create a data frame
  > with only the most recent  rule for each account.
but your code gives 2 rules for account A2, because
there is a tie in the dates.  Is that what you want?

It makes thinks much simpler for R-helpers if a request
for help includes details how how to make a typical
input object and exactly what is wanted to be done.

In the runs-based approach I suggested, ties are broken
by the original order of the file.  Returning all rules
for the maximum date would be more complicated using this
approach.

isLastInRun <- function (x, ...)
{
    retval <- c(x[-1] != x[-length(x)], TRUE)
    for (y in list(...)) {
        stopifnot(length(x) == length(y))
        retval <- retval | c(x[-1] != x[-length(x)], TRUE)
    }
    retval
}
f2 <- function(data) {
    o <- order(data[, "ACCOUNT"], data[, "DATE"])
    tmp <- logical(length(o))
    tmp[o] <- isLastInRun(data[o, "ACCOUNT"])
    data[tmp,]
}

f2() works on either class of DATE column.  It returns
the same class of DATE as the input class, because it
just returns a subset of the rows of the original
data.frame.  The row names/numbers in the output show
which rows of the input were selected.
> f2(my.mapping.Date)
  ACCOUNT  RULE       DATE
1      A1 Rule1 2010-01-01
3      A2 Rule3 2007-05-01
> f2(my.mapping.character)
  ACCOUNT  RULE       DATE
1      A1 Rule1 2010-01-01
3      A2 Rule3 2007-05-01

I generated a random dataset with 1 million rows (and
c. 2.3 rules/account) with

gen <- function (n, Date = FALSE) {
    set.seed(1)
    d <- data.frame(stringsAsFactors = FALSE,
        ACCOUNT = paste(sep = "", "A",
           sample(floor(n/2), size = n, replace = TRUE)),
        RULE = paste(sep = "", "Rule", 1:n),
        DATE = sprintf("%04d-%02d-%02d",
           sample(1995:2010, size = n, replace = TRUE),
           sample(1:12, size = n, replace = TRUE), 
           sample(1:28, size = n, replace = TRUE))
        )
    if (Date) {
        d$DATE <- as.Date(d$DATE)
    }
    d
}
d6 <- gen(n=10^6, Date=FALSE)

and got the following processing times

> system.time(r0 <- f0(d6))
   user  system elapsed 
  79.96    0.36   73.94 
> system.time(r2 <- f2(d6))
   user  system elapsed 
  19.81    0.02   18.18 

For n=10^5, the times were 4.47 for f0 and 1.23 for f2.

The outputs differed only in that f0 returned all rules
for the last date for an account and f2 returned only
the last rule.

If you want all rules for the last date (and don't mind
the failure if your dataset contains Date columns) then
you may prefer f0.  Otherwise you may prefer f2 for its
speed.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> The most efficient solution I came across involves the 
> following steps:
> 
> 1. Find the latest date for each account, and convert it to a 
> data frame:
> 
> a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max)
> a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d"))
> 2. merge the set with the original data
> 
> my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT")
> 
> 3. Create a take column, which is to confirm if the date of 
> the row is the
> maximum date for the account.
> my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT)
> 4. Filter out all lines except those with TAKE==TRUE.
> 
> my.mapping<-my.mapping[my.mapping$TAKE==TRUE,]
> The running time for my whole list was 4.5 sec which is far 
> better than any
> other ways I tried. Let me have your thoughts on that.
> 
> Ali
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>