[R] Removing rows with earlier dates

Martin Maechler maechler at stat.math.ethz.ch
Wed Dec 29 15:39:08 CET 2010


>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>>     on Fri, 24 Dec 2010 11:47:05 -0500 writes:

    > On Dec 24, 2010, at 11:04 AM, David Winsemius wrote:

    >> 
    >> On Dec 24, 2010, at 8:45 AM, Ali Salekfard wrote:
    >> 
    >>> Hi all,
    >>> 
    >>> I'm new to the list but have benfited from it quite extensively.  
    >>> Straight to
    >>> my rather strange question:
    >>> 
    >>> I have a data frame that contains mapping rules in this way:
    >>> 
    >>> ACCOUNT, RULE COLUMNS, Effective Date
    >>> 
    >>> 
    >>> The dataframe comes from a database that stores all dates. What I  
    >>> would like
    >>> to do is to create a data frame with only the most recent rule for  
    >>> each
    >>> account. In traditional programming languages I would loop through  
    >>> each
    >>> account find the most recent rule(s) and fill up my updated data  
    >>> frame.
    >>> 
    >>> Does anyone have any better idea to use R's magic (Its syntax is  
    >>> still
    >>> magical to me) for this problem?
    >> 
    >> It's going to remain magic until you start thinking about what is  
    >> needed. In this case the need is for a good understanding of the  
    >> structure of the data object and the str function is the usual way  
    >> to examine such AND to then communicate with the list. Read the  
    >> Posting Guide again and the references it cites, please.
    >> 
    >>> 
    >> 
    >> Here would have been my first attempt, assuming a dataframe named  
    >> dfrm:
    >> #make sure the most recent is on top
    >> dfrm <- dfrm[ order(dfrm["Effective Date"], decreasing=TRUE), ]
    >> # then pull the first record within ACCOUNT
    >> tapply(dfrm, dfrm$ACCOUNT , FUN= "[", 1 , )
    >> 
    >> 
    >>> By the way  the list of rules is quite extensive (144643 lines to be
    >>> precise), and there are usually 1-3 most recent rules (rows) for each
    >>> account.
    >> 
    >> That is a bit different than the initial problem statement in which  
    >> you asked for the "only the most recent" within each account. How  
    >> are we supposed to get 3 _most_ recent rules? I think you are  
    >> expecting us to read your mind regarding how you are thinking about  
    >> this problem and pull all the records with the maximum date within  
    >> an account.
    >> 
    >> Perhaps this effort to create a logical vector would be in the right  
    >> direction:
    >> 
    >> dfrm[ ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"], function(x) x  
    >> == max(x), ]
    >> 
    >> It should pull all records for which the Effective Date is equal to  
    >> the maximum within ACCOUNT. It is going to depend on whether  
    >> "Effective Date" of of a class that can be properly compared with  
    >> max(). Both Date and character representations of dates in standard  
    >> y-m-d form would qualify. Other date formats might not:
    >> > max("01-02-2011", "02-01-2010")
    >> [1] "02-01-2010"
    >> 

    > When I used the strategy on the airquality dataset I do not get the  
    > results I expected, but a modification did succeed:

    >> airquality[ airquality$Day == ave(airquality$Day, airquality$Month,  
    > FUN=function(x){ max(x)} ), ]
    > Ozone Solar.R Wind Temp Month Day
    > 31     37     279  7.4   76     5  31
    > 61     NA     138  8.0   83     6  30
    > 92     59     254  9.2   81     7  31
    > 123    85     188  6.3   94     8  31
    > 153    20     223 11.5   68     9  30

Hmm, yes, but   " FUN = function(x) { max(x) } "
is so ugly that it hurts my R-eyes.
Just use  'FUN = max'  .. please ..

and as we are in making things more readable,
I'd like to propose using with() in these cases -->

 > airquality[with(airquality, Day == ave(Day, Month, FUN=max)),]

     Ozone Solar.R Wind Temp Month Day
 31     37     279  7.4   76     5  31
 61     NA     138  8.0   83     6  30
 92     59     254  9.2   81     7  31
 123    85     188  6.3   94     8  31
 153    20     223 11.5   68     9  30


Regards,
Martin Maechler, ETH Zurich



    > I do suspect it requires that the dataframe be sorted to get the  
    > joint  conditions lined up correctly. The earlier method should have  
    > used an as.logical() wrapper and would then not have needed pre- 
    > sorting the dataframe, so try instead:

    > frm[ as.logical(ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"],  
    > function(x) x == max(x)), ]


    >> 
    >> 
    >> -- 
    >> David Winsemius, MD
    >> West Hartford, CT



More information about the R-help mailing list