[R] ddply for comparing simulation results

Fri Aug 30 23:11:07 CEST 2013

This might do it:

> lhs=c('a','a','a','b')
> rhs=c('a','b','b','b')
>
>
> # function to determine differences
> f_diff <- function(l, r){
+     t_l <- table(l)
+     t_r <- table(r)
+     # compare 'l' to 'r'
+     sapply(names(t_l), function(x){
+         if (is.na(t_r[x])) return(t_l[x])
+         t_l[x] - t_r[x]
+     })
+ }
>
> f_diff(lhs, rhs)
a.a b.b
  2  -2
>

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Fri, Aug 30, 2013 at 1:28 PM, john doe <anon.r.user at gmail.com> wrote:
> Hi Jim,
>
> Thanks for the quick reply!  Data.table sounds like it will help me with my
> performance problem.  However, I think that setdiff does not do quite what I
> need.  Consider this example:
>
>> lhs=c('a','a','a','b')
>> rhs=c('a','b','b','b')
>> setdiff(lhs,rhs)
> character(0)
>
> I need to do an operation between lhs and rhs which gives this result:
>   a: 2
>   b: -2
>
> It looks like base set operations call unique on their vectors before
> performing the intersection, which does not allow me to measure the
> magnitude of the difference between the sets.
>
> Ari
>
>
>
>
> On Fri, Aug 30, 2013 at 5:10 AM, jim holtman <jholtman at gmail.com> wrote:
>>
>> try the 'data.table package.  It gives the answer in less than a second.
>>
>> > # 1 million leads, half of which were simulated, half of which were not
>> > id=1:1000000
>> > isSimulated = c(rep(0,500000), rep(1, 500000))
>> > userId=sample(1:100000, 1000000, replace=T)
>> > df_leads=data.frame(id, isSimulated, userId)
>> > require(data.table)
>> Loading required package: data.table
>> data.table 1.8.8  For help type: help("data.table")
>> > system.time({
>> +     df_leads <- data.table(df_leads)
>> +     df_leads_sum <- df_leads[
>> +         , list(count = .N)
>> +         , keyby = c('isSimulated', 'userId')
>> +         ]
>> + })
>>    user  system elapsed
>>    0.75    0.01    0.76
>> >
>> > head(df_leads_sum)
>>    isSimulated userId count
>> 1:           0      1     5
>> 2:           0      2     9
>> 3:           0      3     5
>> 4:           0      4     4
>> 5:           0      5     3
>> 6:           0      6     7
>>
>>
>> you can use 'setdiff' to find userIDs that are missing from one group
>> or the other:
>>
>> > #see which userIDs are missing between the groups
>> > not_in <- setdiff(df_leads_sum$userId[df_leads_sum$isSimulated == 0]
>> +       , df_leads_sum$userId[df_leads_sum$isSimulated == 1]
>> +       )
>> > str(not_in)
>>  int [1:697] 59 100 204 584 656 828 840 999 1012 1046 ...
>> >
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Thu, Aug 29, 2013 at 11:33 PM, john doe <anon.r.user at gmail.com> wrote:
>> > I am trying to use R and plyr to compare the effectiveness of various
>> > algorithms for online advertising.  At the core, I am simply counting
>> > when a
>> > user receives a lead: this is measured with the userId column.  Leads
>> > that
>> > were sent in production have a 0 in the isSimulated column, and leads
>> > that
>> > were sent in our simulation have isSimulated=1.  I have two questions:
>> > one
>> > about performance and one about how to use plyr to get the data in a
>> > form
>> > that I want.
>> >
>> > Here is an example of my code:
>> >
>> > # 1 million leads, half of which were simulated, half of which were not
>> > id=1:1000000
>> > isSimulated = c(rep(0,500000), rep(1, 500000))
>> > userId=sample(1:100000, 1000000, replace=T)
>> > df_leads=data.frame(id, isSimulated, userId)
>> >
>> > # split by simulated and userid, and then sum
>> > system.time(df_leads_sum <- ddply(df_leads, .(isSimulated, userId),
>> > nrow))
>> >    user  system elapsed
>> >  38.167   0.212  38.386
>> >
>> > The above call to ddply is great because it allows me to create
>> > histograms
>> > of how many people receive just a few leads, or a lot of leads, both in
>> > production and in the simulator.
>> >
>> > Question 1: The above ddply call takes a while to execute.  With
>> > production
>> > data it takes several minutes in R, but only a few seconds in MySQL.  Is
>> > there a way to improve the performance of the above call?
>> >
>> > Question 2: What I would really like to do is create a histogram which
>> > measures the distribution of change in leads between non-simulated and
>> > simulated data.  A complicating fact is that some users might only
>> > appear in
>> > simulated or non-simulated data, so I need to correclty handle the
>> > absense
>> > of a userId.  (In production, users are actually guaranteed to appear in
>> > production - but the crux of the problem is the same: userIds might be
>> > missing in one of the splits).  Can someone help me with this?  I've
>> > read
>> > the documentation a few times, and think that the summarize function
>> > might
>> > be able to help, but I'm not quite sure how to do this.
>> >
>> > Thanks.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "manipulatr" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to manipulatr+unsubscribe at googlegroups.com.
>> > To post to this group, send email to manipulatr at googlegroups.com.
>> > Visit this group at http://groups.google.com/group/manipulatr.
>> > For more options, visit https://groups.google.com/groups/opt_out.
>
>