# [R] averaging between rows with repeated data

R. Michael Weylandt michael.weylandt at gmail.com
Tue Nov 15 13:28:48 CET 2011

Oh sorry -- my mistake with ave() -- I only checked the first row....

drop = F is an optional argument to the function "[" which tells it to
return one of what it began with, rather than simplifying.

E.g.,

X = matrix(1:9, 3)
is.matrix(X)
TRUE

is.matrix(X[,2:3])
TRUE

is.matrix(X[,3])
FALSE # Just a regular vector

is.matrix(X[,3,drop = F])
TRUE

Aggregate wants a list in that second slot and data frames are
secretly also lists, so keeping it as a data frame gives the desired
list.

Michael

On Tue, Nov 15, 2011 at 7:07 AM, Rob Griffin <robgriffin247 at hotmail.com> wrote:
> Thanks Michael,
> That second (aggregate) option worked perfectly - the first (cbind)
> generated averages for each row between the columns. (rather than between
> rows for each column).
> I came so close with aggregate yesterday - it is only slightly different to
> one my attempts (of admittedly very many attempts) to solve it so feels good
> that I was going along the right lines at some point!
>
> Could you possibly explain what this drop=F term is doing?
>
> Rob
> (A very grateful and relieved phd student).
>
> (also if anyone fancies helping me with another problem I posted yesterday:
> http://r.789695.n4.nabble.com/correlations-between-columns-for-each-row-td4039193.html
> )
>
>
> -----Original Message----- From: R. Michael Weylandt
> Sent: Tuesday, November 15, 2011 12:46 PM
> To: robgriffin247
> Cc: r-help at r-project.org
> Subject: Re: [R] averaging between rows with repeated data
>
> Good morning Rob,
>
> First off, thank you for providing a reproducible example. This is one
> of those little tasks that R is pretty great at, but there exist
>>
>> \infty ways to do so and it can be a little overwhelming for the
>
> beginner: here's one with the base function ave():
>
> cbind(ave(example[,2:4], example[,5]), id = example[,5])
>
> This splits example according to the fifth column (id) and averages
> the other values: we then stick another copy of the id back on the end
> and are good to go.
>
> The base function aggregate can do something similar:
>
> aggregate(example[,2:4], by = example[,5, drop = F], mean)
>
> Note that you need the little-publicized but super useful drop = F
> command to make this one work.
>
> There are other ways to do this with the plyr or doBy packages as
> well, but this should get you started.
>
> Hope it helps,
>
> Michael
>
> On Tue, Nov 15, 2011 at 5:52 AM, robgriffin247
> <robgriffin247 at hotmail.com> wrote:
>>
>> *The situation (or an example at least!)*
>>
>> example<-data.frame(rep(letters[1:10]))
>> colnames(example)[1]<-("Letters")
>> example$numb1<-rnorm(10,1,1) >> example$numb2<-rnorm(10,1,1)
>> example$numb3<-rnorm(10,1,1) >> >> example$id<-c("CG234","CG232","CG441","CG128","CG125","CG182","CG232","CG441","CG232","CG125")
>>
>> *this produces something like this:*
>>  Letters     numb1      numb2        numb3    id
>> 1        a 0.8139130 -0.9775570 -0.002996244 CG234
>> 2        b 0.8268700  0.4980661  1.647717998 CG232
>> 3        c 0.2384088  1.0249684  0.120663273 CG441
>> 4        d 0.8215922  0.5686534  1.591208307 CG128
>> 5        e 0.7865918  0.5411476  0.838300185 CG125
>> 6        f 2.2385522  1.2668070  1.268005020 CG182
>> 7        g 0.7403965 -0.6224205  1.374641549 CG232
>> 8        h 0.2526634  1.0282978 -0.110449844 CG441
>> 9        i 1.9333444  1.6667486  2.937252363 CG232
>> 10       j 1.6996701  0.5964623  1.967870617 CG125
>>
>> *The Problem:*
>> Some of these id's are repeated, I want to average the values for those
>> rows
>> within each column but obviously they have different numbers in the
>> numbers
>> column, and they also have different letters in the letters column, the
>> letters are not necessary for my analysis, only the duplicated id's and
>> the
>> numb columns are important
>>
>> I also need to keep the existing dataframe so would like to build a new
>> dataframe that averages the repeated values and keeps their id - my actual
>> dataset is much more complex (271*13890) - but the solution to this can be
>> expanded out to my main data set because there is just more columns of
>> numbers and still only one alphanumeric id to keep in my example data, id
>> CG232 occurs 3 times, CG441 & CG125 occur twice, everthing else once so
>> the
>> new dataframe (from this example) there would be 3 number columns (numb1,
>> numb2, numb3) and an id the numb column values would be the averages of
>> the
>> rows which had the same id
>>
>> so for example the new dataframe would contain an entry for CG125 which
>> would be something like this:
>>
>> numb1    numb2    numb3       id
>> 1.2431     0.5688     1.403         CG125
>>
>> Just as a thought, all of the IDs start with CG so could I use then grep
>> (?)
>> to delete CG and replace it with 0, that way duplicated ids could be
>> averaged as a number (they would be the same) but I still don’t know how
>> to
>> produce the new dataframe with the averaged rows in it...
>>
>> I hope this is clear enough! email me if you need further detail or even
>> better, if you have a solution!!
>> also sorry to be posting my second question in under 24hours but I seem to
>> have become more than a little stuck – I was making such good progress
>> with
>> R!
>>
>> Rob
>>
>> (also I'm sorry if this appears more than once on the mailing list - I'm
>> having some network & windows live issues so I'm not convinced previous
>> attempts to send this have worked, but have no way of telling if they are
>> just milling around in the internet somewhere as we speak and will decide
>> to
>> come out of hiding later!)
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/averaging-between-rows-with-repeated-data-tp4042513p4042513.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>