[R] summarizing a data frame i.e. count -> group by

Sun Oct 23 20:05:21 CEST 2011

This could be done with aggregate but I am unfamiliar with it so I'll give what I think you want from your message using the library 'reshape' that you'll have to doneload.  If you're problem is large the data.table library would be much faster.

You haven't really said what you'd like to get from the output so I'm going by what your code looks like you want. There is no count in R, the function is called length (you may want sum but it does not appear that way).  Also giving the list a bit of what you'd expect for an out put is often helpful.

Here is the code(one of these three options is what you want I think:

library(reshape)
throughput1 <- cast(df, time~partitioning_mode, value="runtime",  length)
throughput2 <- cast(df, partitioning_mode~time, value="runtime",  length)
throughput3 <- cast(df, partitioning_mode + workload~time, value="runtime", length)

----------------------------------------
> From: bravegag at gmail.com
> Date: Sun, 23 Oct 2011 19:29:40 +0200
> To: r-help at r-project.org
> Subject: [R] summarizing a data frame i.e. count -> group by
>
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> time partitioning_mode workload runtime
> 1 1 sharding query 607
> 2 1 sharding query 85
> 3 1 sharding query 52
> 4 1 sharding query 79
> 5 1 sharding query 77
> 6 1 sharding query 67
> 7 1 sharding query 98
> 8 1 sharding refresh 2932
> 9 1 sharding refresh 2870
> 10 1 sharding refresh 2877
> 11 1 sharding refresh 2868
> 12 1 replication query 2891
> 13 1 replication query 2907
> 14 1 replication query 2922
> 15 1 replication query 2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly this:
>
> insert into throughput
> select time, partitioning_mode, count(*)
> from data.frame
> group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
> > throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
> > throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
> >throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.