# [R] averaging between rows with repeated data

David Winsemius dwinsemius at comcast.net
Tue Nov 15 14:53:17 CET 2011

On Nov 15, 2011, at 6:46 AM, R. Michael Weylandt wrote:

> Good morning Rob,
>
> First off, thank you for providing a reproducible example. This is one
> of those little tasks that R is pretty great at, but there exist
>> \infty ways to do so and it can be a little overwhelming for the
> beginner: here's one with the base function ave():
>
> cbind(ave(example[,2:4], example[,5]), id = example[,5])
>
> This splits example according to the fifth column (id) and averages
> the other values: we then stick another copy of the id back on the end
> and are good to go.
>
> The base function aggregate can do something similar:
>
> aggregate(example[,2:4], by = example[,5, drop = F], mean)
>
> Note that you need the little-publicized but super useful drop = F
> command to make this one work.

The way I usually deal with that is to wrap list() around the by=
bet an error message complaining : "'by' must be a list". (drop=FALSE
has the effect of keeping data.frame columns as lists too, so I am not
disagreeing here.)

aggregate(example[,2:4], by = list(example[,5]), mean)

--
David.

>
> There are other ways to do this with the plyr or doBy packages as
> well, but this should get you started.
>
> Hope it helps,
>
> Michael
>
> On Tue, Nov 15, 2011 at 5:52 AM, robgriffin247
> <robgriffin247 at hotmail.com> wrote:
>> *The situation (or an example at least!)*
>>
>> example<-data.frame(rep(letters[1:10]))
>> colnames(example)[1]<-("Letters")
>> example$numb1<-rnorm(10,1,1) >> example$numb2<-rnorm(10,1,1)
>> example$numb3<-rnorm(10,1,1) >> example$id<-
>> c
>> ("CG234
>> ","CG232
>> ","CG441","CG128","CG125","CG182","CG232","CG441","CG232","CG125")
>>
>> *this produces something like this:*
>>  Letters     numb1      numb2        numb3    id
>> 1        a 0.8139130 -0.9775570 -0.002996244 CG234
>> 2        b 0.8268700  0.4980661  1.647717998 CG232
>> 3        c 0.2384088  1.0249684  0.120663273 CG441
>> 4        d 0.8215922  0.5686534  1.591208307 CG128
>> 5        e 0.7865918  0.5411476  0.838300185 CG125
>> 6        f 2.2385522  1.2668070  1.268005020 CG182
>> 7        g 0.7403965 -0.6224205  1.374641549 CG232
>> 8        h 0.2526634  1.0282978 -0.110449844 CG441
>> 9        i 1.9333444  1.6667486  2.937252363 CG232
>> 10       j 1.6996701  0.5964623  1.967870617 CG125
>>
>> *The Problem:*
>> Some of these id's are repeated, I want to average the values for
>> those rows
>> within each column but obviously they have different numbers in the
>> numbers
>> column, and they also have different letters in the letters column,
>> the
>> letters are not necessary for my analysis, only the duplicated id's
>> and the
>> numb columns are important
>>
>> I also need to keep the existing dataframe so would like to build a
>> new
>> dataframe that averages the repeated values and keeps their id - my
>> actual
>> dataset is much more complex (271*13890) - but the solution to this
>> can be
>> expanded out to my main data set because there is just more columns
>> of
>> numbers and still only one alphanumeric id to keep in my example
>> data, id
>> CG232 occurs 3 times, CG441 & CG125 occur twice, everthing else
>> once so the
>> new dataframe (from this example) there would be 3 number columns
>> (numb1,
>> numb2, numb3) and an id the numb column values would be the
>> averages of the
>> rows which had the same id
>>
>> so for example the new dataframe would contain an entry for CG125
>> which
>> would be something like this:
>>
>> numb1    numb2    numb3       id
>> 1.2431     0.5688     1.403         CG125
>>
>> Just as a thought, all of the IDs start with CG so could I use then
>> grep (?)
>> to delete CG and replace it with 0, that way duplicated ids could be
>> averaged as a number (they would be the same) but I still don’t
>> know how to
>> produce the new dataframe with the averaged rows in it...
>>
>> I hope this is clear enough! email me if you need further detail or
>> even
>> better, if you have a solution!!
>> also sorry to be posting my second question in under 24hours but I
>> seem to
>> have become more than a little stuck – I was making such good
>> progress with
>> R!
>>
>> Rob
>>
>> (also I'm sorry if this appears more than once on the mailing list
>> - I'm
>> having some network & windows live issues so I'm not convinced
>> previous
>> attempts to send this have worked, but have no way of telling if
>> they are
>> just milling around in the internet somewhere as we speak and will
>> decide to
>> come out of hiding later!)
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/averaging-between-rows-with-repeated-data-tp4042513p4042513.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help