[R] how to subset based on other row values and multiplicity

John McKown john.archie.mckown at gmail.com
Wed Jul 16 15:25:42 CEST 2014

On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
<Scott.Williams at petermac.org> wrote:
> Hi R experts,
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z

Question: the c-y id-value pair occurs 3 times. In two cases
(2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
is >30 days. Why isn't
c 2000-10-01 y
also part of the result? Is it because you only want a single id-value
pair in which the date is the minimal? Or you want the one in which
the date difference is maximal? Or you overlooked that particular
match? I can't figure it out from your description.

> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.

Hum, is the source of this data in a relational database such as
Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
because some of this processing might be easier do to in the data base
using a "self join", instead of reading the entire relational table
into a data.frame and doing it in R.

> Thanks in advance
> Scott Williams MD
> Melbourne, Australia

There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

More information about the R-help mailing list