[R] getting means by group within time point for data on multiple lines (long rather than wide file)

Thu Sep 17 13:36:16 CEST 2015

On 17/09/2015 7:06 AM, John Sorkin wrote:
> I have a long (rather than wide file), i.e. the data for each subject is on multiple lines rather than one line. Each line has the following layout:
> subject group time value
> I have two groups, multiple subjects, each subject can be seen up to three times a time 0, and at most once at times 4 and 8.
> An example of the data follows:
> 
> 1 control 0 100
> 1 control 0 NA
> 1 control 0 55
> 1 control 4 100
> 1 control 8 100
> 
> 2 exp 0 99
> 2 exp 0 67
> 2 exp 0 66
> 2 exp 4 110
> 2 exp 8 200
> 
> I need to get means by group (control vs. exp) within time (0,4,8). The means should include only those subjects who have at least one observation at each time point (0, 4, 8). I also need to determine the number of subjects who contribute data at each time-point by group. Any suggestion on how to get them means would be appreciated. Sad to say I worked on this for four hours last night without coming to any understanding how this can be done. UGG!  

Do it in two stages.  First, group the data by subject id, and delete
any subjects that don't have sufficient observations.  Then group by
treatment and time and take means.

The tapply() or by() functions will be useful for both of these steps.
For example,

do.call(rbind,
  by(x, x$subjectid,
     function(sub)
       if (length(unique(sub$times)) == 3) sub
       else NULL))

will remove subjects with other than 3 observed times.  (It doesn't take
NA into account; if you need to do that, you'll need to make that
function(sub) more complicated.  "sub" will be a dataframe containing
data for just one subject.)

The "do.call(rbind" puts the list output from by() back together as a
single dataframe.

Duncan Murdoch