[R] subsets

Peter Ehlers ehlers at ucalgary.ca
Thu Jan 20 14:56:47 CET 2011


On 2011-01-20 02:05, Taras Zakharko wrote:
> Hello Den,
>
> your problem is not as it may seem so Ivan's suggestion is only a partial answer. I see that each patient can have
> more then one diagnosis and I take that you want to isolate patients based on particular conditions.
> Thus, simply looking for "ah" or "idh" as Ivan suggests will yield patients which can have either of those but not
> necessarily patients that have both.
>
> Instead, what one must do is apply the condition to the whole set of diagnosis associated with each patient.
> I think that its done best with the aggregate function. This function splits the data according to some
> factor (in our case it will be the patient id) and performs a routine on each subset (in our case it will be
> a condition test):
>
>
> ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&&   "ihd" %in% x)
> ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&&   !"ihd" %in% x)
> ids<- aggregate(diagnosis ~ id, df, function(x) ! "ah" %in% x&&   "ihd" %in% x)
>
> Now, ids will contain a data frame like:
>
> id	diagnosis
> 1	TRUE
> 2	FALSE
> 3	FALSE
> ...
>
> which shows which patients have the set of diagnoses you asked for. You can then apply these
> patients to the original data by something like:
>
> subset(df, id %in% subset(ids, diagnosis == TRUE)$id)
>
> this will extract only patients from the 'ids' data frame  for which  the diagnosis applies and then extract the associated
> diagnosis sets from the original 'df' data frame.
>
> Hope it helps,
>
> Taras

Here's a tidy version using the plyr package:

require(plyr)
df1 <- ddply(df, .(id), summarize,
      has.both = ("ah" %in% diagnosis) & ("ihd" %in% diagnosis),
      has.only.ah = ("ah" %in% diagnosis) & !("ihd" %in% diagnosis),
      has.only.ihd = !("ah" %in% diagnosis) & ("ihd" %in% diagnosis)
)

Further processing on the columns of df1 is straightforward.

Peter Ehlers

> On Jan 20, 2011, at 9:53 , Den wrote:
>
>> Dear R people
>> Could you please help.
>>
>> Basically, there are two variables in my data set. Each patient ('id')
>> may have one or more diseases ('diagnosis'). It looks like
>>
>> id	diagnosis
>> 1	ah
>> 2	ah
>> 2	ihd
>> 2	im
>> 3	ah
>> 3	stroke
>> 4	ah
>> 4	ihd
>> 4	angina
>> 5	ihd
>> ..............
>> Q: How to make three data sets:
>> 	1. Patients with ah and ihd
>> 	2. Patients with ah but no ihd
>> 	3. Patients with  ihd but no ah?
>>
>> If you have any ideas could just guide what should I look for. Is a
>> subset or aggregate, or loops, or something else??? I am a bit lost. (F1
>> F1 F1 !!!:)
>> Thank you
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list