[R] how to subset based on other row values and multiplicity

John McKown john.archie.mckown at gmail.com
Wed Jul 16 16:01:21 CEST 2014


Thanks. So you only want a single entry with a given "id" & "value",
even if there are multiple possible confirmations.

Too bad about not being in an SQL data base. I've already partially
solved the problem using PostgreSQL. Just in case you, or others,
might be interested, below is a transcript of what I have. The SQL
might suggest a possible approach in native R.

<transcript>
tsh009=# select * from datedata;
 id |    date    | value
----+------------+-------
 a  | 2000-01-01 | x
 a  | 2000-03-01 | x
 b  | 2000-11-11 | w
 c  | 2000-11-11 | y
 c  | 2000-10-01 | y
 c  | 2000-09-10 | y
 c  | 2000-12-12 | z
 c  | 2000-10-11 | z
 d  | 2000-11-11 | w
 d  | 2000-11-10 | w
(10 rows)

tsh009=# select a.id,a.date,a.value
from datedata as a
join datedata as b
on a.id = b.id and a.value=b.value
where b.date - a.date > 30;
 id |    date    | value
----+------------+-------
 a  | 2000-01-01 | x
 c  | 2000-10-01 | y
 c  | 2000-09-10 | y
 c  | 2000-10-11 | z
(4 rows)

</transcript>

the only problem is the "multiple confirmation" problem because you
only want / need a single c-y confirmation and my code produces all
possible ones.

On Wed, Jul 16, 2014 at 8:38 AM, Williams Scott
<Scott.Williams at petermac.org> wrote:
> It probably isn’t that clear John - to put it another way - each patient
> (‘id’) can have multiple diagnosis codes (w -> z in this example, several
> thousand in reality) recorded at multiple times. I just need to find the
> ‘confirmed’ diagnosis code or codes for each patient. To be confirmed they
> have to occur at least twice and at least a month apart. So patient c has
> 2 diagnoses recorded, 1 recorded twice and one thrice; each confirmed by
> multiplicity and time.
>
> The data is delivered as a flat .txt file. I’m not proficient with any
> databases other than MS Access unfortunately, and the 120Gb of data is not
> easily managed in Access.
>
> I hope that helps
> S
>
> On 16/07/2014 11:25 pm, "John McKown" <john.archie.mckown at gmail.com> wrote:
>
>>On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
>><Scott.Williams at petermac.org> wrote:
>>> Hi R experts,
>>>
>>> I have a dataset as sampled below. Values are only regarded as
>>>Œconfirmed¹
>>> in an individual (Œid¹) if they occur
>>> more than once at least 30 days apart.
>>>
>>>
>>> id   date value
>>> a    2000-01-01 x
>>> a    2000-03-01 x
>>> b    2000-11-11 w
>>> c    2000-11-11 y
>>> c    2000-10-01 y
>>> c    2000-09-10 y
>>> c    2000-12-12 z
>>> c    2000-10-11 z
>>> d    2000-11-11 w
>>> d    2000-11-10 w
>>>
>>>
>>> I wish to subset the data to retain rows where the value for the
>>> individual is confirmed more than 30 days apart. So, after deleting all
>>> rows with just one occurrence of id and value, the rest would be the
>>> earliest occurrence of each value in each case id, provided 31 or more
>>> days exist between the dates. If >1 value is present per id, each value
>>> level needs to be assessed independently. This example would then reduce
>>> to:
>>>
>>>
>>> id   date           value
>>> a    2000-01-01 x
>>> c    2000-09-10 y
>>> c    2000-10-11 z
>>
>>Question: the c-y id-value pair occurs 3 times. In two cases
>>(2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
>>is >30 days. Why isn't
>>c 2000-10-01 y
>>also part of the result? Is it because you only want a single id-value
>>pair in which the date is the minimal? Or you want the one in which
>>the date difference is maximal? Or you overlooked that particular
>>match? I can't figure it out from your description.
>>
>>>
>>>
>>>
>>> I can do this via some crude loops and subsetting, but I am looking for
>>>as
>>> much efficiency as possible
>>> as the dataset has around 50 million rows to assess. Any suggestions
>>> welcomed.
>>
>>Hum, is the source of this data in a relational database such as
>>Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
>>because some of this processing might be easier do to in the data base
>>using a "self join", instead of reading the entire relational table
>>into a data.frame and doing it in R.
>>
>>>
>>> Thanks in advance
>>>
>>> Scott Williams MD
>>> Melbourne, Australia
>>>
>>
>>--
>>There is nothing more pleasant than traveling and meeting new people!
>>Genghis Khan
>>
>>Maranatha! <><
>>John McKown
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
>



-- 
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown



More information about the R-help mailing list