[R] how to create data.frames from vectors with duplicates

Thu Sep 8 03:55:18 CEST 2011

Hi:

Here are a few informal timings on my machine with the following
example. The data.table package is worth investigating, particularly
in problems where its advantages can scale with size.

library(data.table)
dt <- data.table(x = sample(1:50, 1000000, replace = TRUE),
                  y = sample(letters[1:26], 1000000, replace = TRUE),
                  key = 'y')
system.time(dt[, list(count = sum(x)), by = 'y'])
   user  system elapsed
   0.02    0.00    0.02

# Data tables are also data frames, so we can use them as such:

system.time(with(dt, tapply(x, y, sum)))
   user  system elapsed
   0.39    0.00    0.39
system.time(with(dt, rowsum(x, y)))
   user  system elapsed
   0.04    0.00    0.03
system.time(aggregate(x ~ y, data = dt, FUN = sum))
   user  system elapsed
   1.87    0.00    1.87

So rowsum() is good, but data.table is a little better for this task.
Increasing the size of the problem is to the advantage of both
data.table and rowsum(), but tapply() takes a fair bit longer,
relatively speaking (appx. 10x rowsum() in the first example, 20x in
the second example). The ratios of rowsum() to data.table are about
the same (appx. 2x).

# 10M observations, 1000 groups
> dt <- data.table(x = sample(1:100, 10000000, replace = TRUE),
+                  y = sample(1:1000, 10000000, replace = TRUE),
+                  key = 'y')
> system.time(dt[, list(count = sum(x)), by = 'y'])
   user  system elapsed
   0.16    0.03    0.18
> system.time(with(dt, rowsum(x, y)))
   user  system elapsed
   0.36    0.04    0.40
> system.time(with(dt, tapply(x, y, sum)))
   user  system elapsed
   8.77    0.33    9.11

HTH,
Dennis

On Wed, Sep 7, 2011 at 6:18 PM, zhenjiang xu <zhenjiang.xu at gmail.com> wrote:
> Thanks for all your replies. I am using rowsum() and it looks efficient. I
> hope I could do some benchmark sometime in near future and let people know.
> Or is there any benchmark result available?
>
> On Wed, Aug 31, 2011 at 12:58 PM, Bert Gunter <gunter.berton at gene.com>wrote:
>
>> Inline below:
>>
>> On Wed, Aug 31, 2011 at 9:50 AM, Jorge I Velez <jorgeivanvelez at gmail.com>
>> wrote:
>> > Hi Zhenjiang,
>> >
>> > Try
>> >
>> > table(unlist(mapply(function(x, y) rep(x, y), y, x)))
>>
>> Yikes! How about simply tapply(x,y,sum) ??
>> ?tapply
>>
>> -- Bert
>> >
>> > HTH,
>> > Jorge
>> >
>> >
>> > On Wed, Aug 31, 2011 at 12:45 PM, zhenjiang xu <> wrote:
>> >
>> >> Hi R users,
>> >>
>> >> suppose I have two vectors,
>> >>  > x=c(1,2,3,4,5)
>> >>  > y=c('a','b','c','a','c')
>> >> How can I get a data.frame like this?
>> >> > xy
>> >>      count
>> >> a     5
>> >> b     2
>> >> c     8
>> >>
>> >> I know a few ways to fulfill the task. However, I have a huge number
>> >> of this kind calculations, so I'd like an efficient solution. Thanks
>> >>
>> >> --
>> >> Best,
>> >> Zhenjiang
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>
>
>
> --
> Best,
> Zhenjiang
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>