[R] identify duplicate entries in data frame and calculate mean

Tom Wright tom at maladmin.com
Tue May 24 22:23:38 CEST 2016


Don't see that as being a big problem. If your data grows then dplyr
supports connections to external databases. Alternately if you just want a
mean, most databases can do that directly in SQL.

On Tue, May 24, 2016 at 4:17 PM, Matthew <mccormack at molbio.mgh.harvard.edu>
wrote:

> Thank you very much, Tom.
> This gets me thinking in the right direction.
> One thing I should have mentioned that I did not is that the number of
> rows in the data frame will be a little over 40,000 rows.
>
>
> On 5/24/2016 4:08 PM, Tom Wright wrote:
>
> Using dplyr
>
> $ library(dplyr)
> $ x<-data.frame(Length=c(321,350,340,180,198),
>                         ID=c(rep('A234',3),'B123','B225') )
> $ x %>% group_by(ID) %>% summarise(m=mean(Length))
>
>
>
> On Tue, May 24, 2016 at 3:46 PM, Matthew <mccormack at molbio.mgh.harvard.edu
> > wrote:
>
>> I have a data frame with 10 columns.
>> In the last column is an alphaneumaric identifier.
>> For most rows, this alphaneumaric identifier is unique to the file,
>> however some of these alphanemeric idenitifiers occur in duplicate,
>> triplicate or more. When they do occur more than once they are in
>> consecutive rows, so when there is a duplicate or triplicate or
>> quadruplicate (let's call them multiplicates), they are in consecutive rows.
>>
>> In column 7 there is an integer number (may or may not be unique. does
>> not matter).
>>
>> I want to identify each multiple entries (multiplicates) occurring in
>> column 10 and then for each multiplicate calculate the mean of the integers
>> column 7.
>>
>> As an example, I will show just two columns:
>> Length  Identifier
>> 321     A234
>> 350     A234
>> 340     A234
>> 180     B123
>> 198     B225
>>
>> What I want to do (in the above example) is collapse all the A234's and
>> report the mean to get this:
>> Length  Identifier
>> 337     A234
>> 180     B123
>> 198     B225
>>
>>
>> Matthew
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list