[R] Grouping data in a data frame: is there an efficient way todo it?

Bert Gunter gunter.berton at gene.com
Thu Sep 3 01:12:18 CEST 2009


table() and xtabs() are fast only because they are just doing counts. If you
want the general case, you need ?tapply. aggregate() is basically a wrapper
for lapply and so you may have the same performance issues with tapply . Try
it to see. They are essentially doing the sort of hash table you describe in
R code.

Whether Perl or Python would be faster I cannot say -- but are you including
the time required to develop and debug the script in your assessment?

Bert Gunter
Genentech Nonclinical Biostatistics


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of David Winsemius
Sent: Wednesday, September 02, 2009 3:59 PM
To: Leo Alekseyev
Cc: r-help at r-project.org
Subject: Re: [R] Grouping data in a data frame: is there an efficient way
todo it?

table is reasonably fast. I have more than 4 X 10^6 records and a 2D  
table takes very little time:

  nUA <- with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are  
factors
  nUA

This does the same thing and took about 5 seconds just now:

xtabs( ~ URwbc + URrbc, data=TRdta)

On Sep 2, 2009, at 6:39 PM, Leo Alekseyev wrote:

> I have a data frame with about 10^6 rows; I want to group the data
> according to entries in one of the columns and do something with it.
> For instance, suppose I want to count up the number of elements in
> each group.  I tried something like aggregate(my.df$my.field,
> list(my.df$my.field), length) but it seems to be very slow.  Likewise,
> the split() function was slow (I killed it before it completed).  Is
> there a way to efficiently accomplish this in R?..  I am almost
> tempted to write an external Perl/Python script entering every row
> into a hashtable keyed by my.field and iterating over the keys...
> Might this be faster?..



David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list