[R] averaging between rows with repeated data

David Winsemius dwinsemius at comcast.net
Tue Nov 15 14:53:17 CET 2011


On Nov 15, 2011, at 6:46 AM, R. Michael Weylandt wrote:

> Good morning Rob,
>
> First off, thank you for providing a reproducible example. This is one
> of those little tasks that R is pretty great at, but there exist
>> \infty ways to do so and it can be a little overwhelming for the
> beginner: here's one with the base function ave():
>
> cbind(ave(example[,2:4], example[,5]), id = example[,5])
>
> This splits example according to the fifth column (id) and averages
> the other values: we then stick another copy of the id back on the end
> and are good to go.
>
> The base function aggregate can do something similar:
>
> aggregate(example[,2:4], by = example[,5, drop = F], mean)
>
> Note that you need the little-publicized but super useful drop = F
> command to make this one work.

The way I usually deal with that is to wrap list() around the by=  
argument  ... since I usually forget about this aggregate quirk and  
bet an error message complaining : "'by' must be a list". (drop=FALSE  
has the effect of keeping data.frame columns as lists too, so I am not  
disagreeing here.)

aggregate(example[,2:4], by = list(example[,5]), mean)

-- 
David.


>
> There are other ways to do this with the plyr or doBy packages as
> well, but this should get you started.
>
> Hope it helps,
>
> Michael
>
> On Tue, Nov 15, 2011 at 5:52 AM, robgriffin247
> <robgriffin247 at hotmail.com> wrote:
>> *The situation (or an example at least!)*
>>
>> example<-data.frame(rep(letters[1:10]))
>> colnames(example)[1]<-("Letters")
>> example$numb1<-rnorm(10,1,1)
>> example$numb2<-rnorm(10,1,1)
>> example$numb3<-rnorm(10,1,1)
>> example$id<- 
>> c 
>> ("CG234 
>> ","CG232 
>> ","CG441","CG128","CG125","CG182","CG232","CG441","CG232","CG125")
>>
>> *this produces something like this:*
>>  Letters     numb1      numb2        numb3    id
>> 1        a 0.8139130 -0.9775570 -0.002996244 CG234
>> 2        b 0.8268700  0.4980661  1.647717998 CG232
>> 3        c 0.2384088  1.0249684  0.120663273 CG441
>> 4        d 0.8215922  0.5686534  1.591208307 CG128
>> 5        e 0.7865918  0.5411476  0.838300185 CG125
>> 6        f 2.2385522  1.2668070  1.268005020 CG182
>> 7        g 0.7403965 -0.6224205  1.374641549 CG232
>> 8        h 0.2526634  1.0282978 -0.110449844 CG441
>> 9        i 1.9333444  1.6667486  2.937252363 CG232
>> 10       j 1.6996701  0.5964623  1.967870617 CG125
>>
>> *The Problem:*
>> Some of these id's are repeated, I want to average the values for  
>> those rows
>> within each column but obviously they have different numbers in the  
>> numbers
>> column, and they also have different letters in the letters column,  
>> the
>> letters are not necessary for my analysis, only the duplicated id's  
>> and the
>> numb columns are important
>>
>> I also need to keep the existing dataframe so would like to build a  
>> new
>> dataframe that averages the repeated values and keeps their id - my  
>> actual
>> dataset is much more complex (271*13890) - but the solution to this  
>> can be
>> expanded out to my main data set because there is just more columns  
>> of
>> numbers and still only one alphanumeric id to keep in my example  
>> data, id
>> CG232 occurs 3 times, CG441 & CG125 occur twice, everthing else  
>> once so the
>> new dataframe (from this example) there would be 3 number columns  
>> (numb1,
>> numb2, numb3) and an id the numb column values would be the  
>> averages of the
>> rows which had the same id
>>
>> so for example the new dataframe would contain an entry for CG125  
>> which
>> would be something like this:
>>
>> numb1    numb2    numb3       id
>> 1.2431     0.5688     1.403         CG125
>>
>> Just as a thought, all of the IDs start with CG so could I use then  
>> grep (?)
>> to delete CG and replace it with 0, that way duplicated ids could be
>> averaged as a number (they would be the same) but I still don’t  
>> know how to
>> produce the new dataframe with the averaged rows in it...
>>
>> I hope this is clear enough! email me if you need further detail or  
>> even
>> better, if you have a solution!!
>> also sorry to be posting my second question in under 24hours but I  
>> seem to
>> have become more than a little stuck – I was making such good  
>> progress with
>> R!
>>
>> Rob
>>
>> (also I'm sorry if this appears more than once on the mailing list  
>> - I'm
>> having some network & windows live issues so I'm not convinced  
>> previous
>> attempts to send this have worked, but have no way of telling if  
>> they are
>> just milling around in the internet somewhere as we speak and will  
>> decide to
>> come out of hiding later!)
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/averaging-between-rows-with-repeated-data-tp4042513p4042513.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list