[R] Selecting one row or multiple rows per ID

hadley wickham h.wickham at gmail.com
Wed Mar 4 15:55:30 CET 2009


On Wed, Mar 4, 2009 at 12:09 AM, Vedula, Satyanarayana
<svedula at jhsph.edu> wrote:
> Hi,
>
> Could someone help with coding this in R?
>
> I need to select one row per patient i in clinic j. The data is organized similar to that shown below.
>
> Two columns - patient i in column j identify each unique patient. There are two columns on outcome. Some patients have multiple rows with each row representing one visit, coded for in the column, visit. Some patients have just one row indicating data from a single visit.
>
> I need to select one row per patient i in clinic j using the following algorithm:
>
> If patient has outcome recorded at visit 2, then outcome = outcome columns at visit 2
> If patient does not have visit 2, then outcome = outcome at visit 5
> If patient does not have visit 2 and visit 5, then outcome = outcome at visit 4
> If patient does not have visits 2, 5, and 4, then outcome = outcome at visit 3
> If patient does not have visits 2, 5, 4, and 3, then outcome = outcome at visit 1
> If patient does not have any of the visits, outcome = missing
>
>
> Patient     Clinic     Visit     Outcome_left   Outcome_right
> patient 1  clinic 1   visit 2        22                        21
> patient 1  clinic 3   visit 1        21                        21
> patient 1  clinic 3   visit 2        21                        22
> patient 1  clinic 3   visit 3        20                        22
> patient 3  clinic 5   visit 1        24                        21
> patient 3  clinic 5   visit 3        21                        22
> patient 3  clinic 5   visit 4        22                        23
> patient 3  clinic 5   visit 5        22                        22
>
> I need to select just the first row for patient 1/clinic 1; the second row (visit 2) for patient 1/clinic 3; and the fourth row (visit 5) for patient 3/clinic 5.

I'd approach this problem in the following way:

df <- read.csv(textConnection("
Patient,Clinic,Visit,Outcome_left,Outcome_right
patient 1,clinic 1,visit 2,22,21
patient 1,clinic 3,visit 1,21,21
patient 1,clinic 3,visit 2,21,22
patient 1,clinic 3,visit 3,20,22
patient 3,clinic 5,visit 1,24,21
patient 3,clinic 5,visit 3,21,22
patient 3,clinic 5,visit 4,22,23
patient 3,clinic 5,visit 5,22,22
"), header = T)
closeAllConnections()


# With a single patient it's pretty easy to find the preferred visit
preferred_visit <- paste("visit", c(2, 5, 4, 3, 1))

one <- subset(df, Patient == "patient 3" & Clinic == "clinic 5")
best_visit <- na.omit(match(preferred_visit, one$Visit))[1]
one[best_visit, ]

# We then turn this into a function
find_best_visit <- function(one) {
  best_visit <- na.omit(match(preferred_visit, one$Visit))[1]
  one[best_visit, ]
}

# Then apply it to every combination of patient and clinic with plyr
ddply(df, .(Patient, Clinic), find_best_visit)

# You can learn more about plyr at http://had.co.nz/plyr


Hadley

-- 
http://had.co.nz/




More information about the R-help mailing list