[R] subsets

Thu Jan 20 14:29:33 CET 2011

On Thu, Jan 20, 2011 at 10:53:01AM +0200, Den wrote:
> Dear R people
> Could you please help.
> 
> Basically, there are two variables in my data set. Each patient ('id')
> may have one or more diseases ('diagnosis'). It looks like 
> 
> id	diagnosis
> 1	ah
> 2	ah
> 2	ihd
> 2	im
> 3	ah
> 3	stroke
> 4	ah
> 4	ihd
> 4	angina
> 5	ihd
> ..............
> Q: How to make three data sets:
> 	1. Patients with ah and ihd
>  	2. Patients with ah but no ihd
> 	3. Patients with  ihd but no ah?

This may be understood as a two step procedure:
1. Split the id into disjoint groups according the above criteria.
2. Split the data cases into the groups from step 1.

If this is what you want, then function table() may be used to
collect information on each id.

  df <- structure(list(id = c(1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 5L),
      diagnosis = structure(c(1L, 1L, 3L, 4L, 1L, 5L, 1L, 3L, 2L, 3L),
      .Label = c("ah", "angina", "ihd", "im", "stroke"), class = "factor")),
      .Names = c("id", "diagnosis"), class = "data.frame", row.names = c(NA, -10L))

  tab <- table(df$id, df$diag)

Then, for example, the data cases for "2. Patients with ah but no ihd"
may be obtained

  sel <- tab[, "ah"] != 0 & tab[, "ihd"] == 0
  ah.noihd <- dimnames(tab)[[1]][sel] # [1] "1" "3"
  df[df$id %in% ah.noihd, ]
  #   id diagnosis
  # 1  1        ah
  # 5  3        ah
  # 6  3    stroke

I hope, this helps.

Petr Savicky.