[R] Removing rows with earlier dates

David Winsemius dwinsemius at comcast.net
Fri Dec 24 17:47:05 CET 2010


On Dec 24, 2010, at 11:04 AM, David Winsemius wrote:

>
> On Dec 24, 2010, at 8:45 AM, Ali Salekfard wrote:
>
>> Hi all,
>>
>> I'm new to the list but have benfited from it quite extensively.  
>> Straight to
>> my rather strange question:
>>
>> I have a data frame that contains mapping rules in this way:
>>
>> ACCOUNT, RULE COLUMNS, Effective Date
>>
>>
>> The dataframe comes from a database that stores all dates. What I  
>> would like
>> to do is to create a data frame with only the most recent rule for  
>> each
>> account. In traditional programming languages I would loop through  
>> each
>> account find the most recent rule(s) and fill up my updated data  
>> frame.
>>
>> Does anyone have any better idea to use R's magic (Its syntax is  
>> still
>> magical to me) for this problem?
>
> It's going to remain magic until you start thinking about what is  
> needed. In this case the need is for a good understanding of the  
> structure of the data object and the str function is the usual way  
> to examine such AND to then communicate with the list. Read the  
> Posting Guide again and the references it cites, please.
>
>>
>
> Here would have been my first attempt, assuming a dataframe named  
> dfrm:
> #make sure the most recent is on top
> dfrm <- dfrm[ order(dfrm["Effective Date"], decreasing=TRUE), ]
> # then pull the first record within ACCOUNT
> tapply(dfrm, dfrm$ACCOUNT , FUN= "[", 1 , )
>
>
>> By the way  the list of rules is quite extensive (144643 lines to be
>> precise), and there are usually 1-3 most recent rules (rows) for each
>> account.
>
> That is a bit different than the initial problem statement in which  
> you asked for the "only the most recent" within each account. How  
> are we supposed to get 3 _most_ recent rules? I think you are  
> expecting us to read your mind regarding how you are thinking about  
> this problem and pull all the records with the maximum date within  
> an account.
>
> Perhaps this effort to create a logical vector would be in the right  
> direction:
>
> dfrm[ ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"], function(x) x  
> == max(x), ]
>
> It should pull all records for which the Effective Date is equal to  
> the maximum within ACCOUNT. It is going to depend on whether  
> "Effective Date" of of a class that can be properly compared with  
> max(). Both Date and character representations of dates in standard  
> y-m-d form would qualify. Other date formats might not:
> > max("01-02-2011", "02-01-2010")
> [1] "02-01-2010"
>

When I used the strategy on the airquality dataset I do not get the  
results I expected, but a modification did succeed:

 > airquality[ airquality$Day == ave(airquality$Day, airquality$Month,  
FUN=function(x){ max(x)} ), ]
     Ozone Solar.R Wind Temp Month Day
31     37     279  7.4   76     5  31
61     NA     138  8.0   83     6  30
92     59     254  9.2   81     7  31
123    85     188  6.3   94     8  31
153    20     223 11.5   68     9  30

I do suspect it requires that the dataframe be sorted to get the  
joint  conditions lined up correctly. The earlier method should have  
used an as.logical() wrapper and would then not have needed pre- 
sorting the dataframe, so try instead:

frm[ as.logical(ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"],  
function(x) x == max(x)), ]


>
>
>
> -- 
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list