[R] How to sum one column in a data frame keyed on other columns

George Nachman gnachman at llamas.org
Wed Dec 13 03:20:48 CET 2006


Thanks, everyone, for the help!

Bill:
This looks like a really great general solution, but unfortunately I
get an error when I call sumUp:

Error in "[.data.frame"(dat, , key_list, drop = FALSE) :
        invalid subscript type

I'm running 2.2.1 because that's the latest I can get on
ubuntu...could that be the problem?


On 12/12/06, Bill.Venables at csiro.au <Bill.Venables at csiro.au> wrote:
> Here is an elementary way of doing it:
>
> > dat
>           url time somethingirrelevant visits
> 1 www.foo.com 1:00                 xxx    100
> 2 www.foo.com 1:00                 yyy     50
> 3 www.foo.com 2:00                 xyz     25
> 4 www.bar.com 1:00                 xxx    200
> 5 www.bar.com 1:00                 zzz    200
> 6 www.foo.com 2:00                 xxx    500
> > dat <- transform(dat, key = paste(url, time))
> > total_visits <- with(dat, tapply(visits, key, sum))
> > m <- match(names(total_visits), dat$key)
> > tdat <- cbind(dat[m, c("url", "time")], total_visits)
> > tdat
>           url time total_visits
> 4 www.bar.com 1:00          400
> 1 www.foo.com 1:00          150
> 3 www.foo.com 2:00          525
> >
>
> This should not be too difficult to morph into a fairly general
> function.  Here's what I might do [warning: somewhat obscure code
> follows]
>
> sumUp <- function(dat, key_list, sum_list) {
>   key <- with(dat, do.call("paste", dat[, key_list, drop = FALSE]))
>   totals <- as.matrix(sapply(dat[, sum_list, drop = FALSE], tapply, key,
> sum))
>   dimnames(totals)[[2]] <- paste("total", sum_list, sep = "_")
>   m <- match(dimnames(totals)[[1]], key)
>   cbind(dat[m, key_list, drop = FALSE], totals)
> }
>
> check:
>
> > sumUp(dat, c("url", "time"), "visits")
>           url time total_visits
> 4 www.bar.com 1:00          400
> 1 www.foo.com 1:00          150
> 3 www.foo.com 2:00          525
>
> > sumUp(dat, "url", "visits")
>           url total_visits
> 4 www.bar.com          400
> 1 www.foo.com          675
>
> Question for the reader: why to you need 'drop = FALSE' (in three
> places)?
>
> Bill Venables.
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of George Nachman
> Sent: Wednesday, 13 December 2006 9:35 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] How to sum one column in a data frame keyed on other
> columns
>
> I have a data frame that looks like this:
>
> url         time somethingirrelevant visits
> www.foo.com 1:00 xxx                 100
> www.foo.com 1:00 yyy                 50
> www.foo.com 2:00 xyz                 25
> www.bar.com 1:00 xxx                 200
> www.bar.com 1:00 zzz                 200
> www.foo.com 2:00 xxx                 500
>
> I'd like to write some code that takes this as input and outputs
> something like this:
>
> url         time total_vists
> www.foo.com 1:00 150
> www.foo.com 2:00 525
> www.bar.com 1:00 400
>
> In other words, I need to calculate the sum of visits for each unique
> tuple of (url,time).
>
> I can do it with this code, but it's very slow, and doesn't seem like
> the right approach:
>
> keys = list()
> getkey = function(m,cols,index) { paste(m[index,cols],collapse=",")  }
> for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 }
> for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] =
> keys[[getkey(data,1:2,i)]] + data[i,4] }
>
> I'm sure there's a more functional-programming approach to this
> problem! Any ideas?
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list