[R] A question built on How to sum one column in a data frame keyed on other columns

David Winsemius dwinsemius at comcast.net
Sat Jul 4 15:40:31 CEST 2009


On Jul 4, 2009, at 6:21 AM, Vincent Vinh-Hung wrote:

> Dear List:
>
> I have a question related to a previous discussion
> How to sum one column in a data frame keyed on other columns
> https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html
> (George Nachman, Bill Venables)
>
> The original query was to calculate the sum of visits for each unique
> tuple of (url, time) from the data frame dat:
>
>> dat
>          url time somethingirrelevant visits
> 1 www.foo.com 1:00                 xxx    100
> 2 www.foo.com 1:00                 yyy     50
> 3 www.foo.com 2:00                 xyz     25
> 4 www.bar.com 1:00                 xxx    200
> 5 www.bar.com 1:00                 zzz    200
> 6 www.foo.com 2:00                 xxx    500
>
> The response gave:
>
>> tdat
>          url time total_visits
> 4 www.bar.com 1:00          400
> 1 www.foo.com 1:00          150
> 3 www.foo.com 2:00          525
>
> My question is how can I build a similar data frame but having also  
> rows for
> combinations of (url, time) that were not in dat?
> In this example, the (url, time) without record of visit would be
> (www.bar.com, 2:00):
>
>> ndat
>          url time total_visits
> www.bar.com 1:00          400
> www.bar.com 2:00            0
> www.foo.com 1:00          150
> www.foo.com 2:00          525
>
> I have tried to build a data frame with
>> adat <- data.frame (url = rep(unique(dat$url),  
>> each=length(unique(dat$time))),
> 	time=unique(dat$time), alt_visits=0)
>> ndat <- merge (adat, tdat, by=c("url", "time"), all = TRUE)
> then replace the NA and remove the alt_visits column.
> But this appears clumsy and quite slow with 10000 rows and 4 columns,
> I would be most grateful for other suggestions,

?expand.grid

Perhaps:

 > fulldat <- expand.grid( unique(dat$url), unique(dat$time) )
 > names(fulldat) <- c("url", "time")
 > merge(fulldat, tdat)
           url time total_visits
1 www.bar.com 1:00          400
2 www.foo.com 1:00          150
3 www.foo.com 2:00          525
 > merge(fulldat, tdat, all=TRUE)  # notice that default is to drop  
missing levels
           url time total_visits
1 www.bar.com 1:00          400
2 www.bar.com 2:00           NA
3 www.foo.com 1:00          150
4 www.foo.com 2:00          525

If they are not ordered, then you might want to sort the elements  
before they get sent to expand.grid for purposes of readability.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list