[R] operations between two aggregated data frames?

Gabor Grothendieck ggrothendieck at gmail.com
Sat May 15 00:38:05 CEST 2010


Generating df1 and df2 as in your post try this (and see
http://sqldf.googlecode.com for more info):

> library(sqldf)
> out <- sqldf("select category,
+ df1.date date1,
+ df2.date date2,
+ df1.date - df2.date datediff
+ from df1 join df2 using(category)
+ order by category, date1, date2")
>
> out[[2]] <- as.Date(out[[2]], origin = "1970-01-01")
> out[[3]] <- as.Date(out[[3]], origin = "1970-01-01")
> out
   category      date1      date2 datediff
1         1 2003-02-08 2003-05-17      -98
2         1 2003-02-08 2003-11-29     -294
3         1 2008-08-16 2003-05-17     1918
4         1 2008-08-16 2003-11-29     1722
5         1 2008-11-29 2003-05-17     2023
6         1 2008-11-29 2003-11-29     1827
7         2 2001-12-01 2005-02-26    -1183
8         2 2001-12-01 2008-10-25    -2520
9         2 2002-01-12 2005-02-26    -1141
10        2 2002-01-12 2008-10-25    -2478
11        3 2003-09-13 2005-10-01     -749
12        3 2003-09-13 2007-08-18    -1435
13        3 2009-08-01 2005-10-01     1400
14        3 2009-08-01 2007-08-18      714
15        4 2000-11-04 2000-11-18      -14
16        4 2000-11-04 2003-05-10     -917
17        4 2000-11-04 2003-07-26     -994
18        4 2000-11-04 2008-11-22    -2940
19        4 2004-03-13 2000-11-18     1211
20        4 2004-03-13 2003-05-10      308
21        4 2004-03-13 2003-07-26      231
22        4 2004-03-13 2008-11-22    -1715
23        4 2007-06-02 2000-11-18     2387
24        4 2007-06-02 2003-05-10     1484
25        4 2007-06-02 2003-07-26     1407
26        4 2007-06-02 2008-11-22     -539

On Fri, May 14, 2010 at 6:13 PM, Jonathan <jonsleepy at gmail.com> wrote:
> Hi All,
>   I've come up with a solution for this problem that relies on a for loop,
> and I was wondering if anybody had any insight into a more elegant method:
>
> I have two data frames, each has a column for categorical data and a column
> for date.  What I'd like to do, ideally, is calculate the number of days
> between all pairs of dates in data frame 1 and data frame 2 (*but only for
> members of the same category*).  The number of members of each category
> varies between the two data frames.
>
> For example:
>
>
>> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks")
>
>> df1 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))
>
>> df2 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))
>
>
>> df1
>     A       date category
> 1   93 2004-02-28        3
> 2  105 2001-03-17        3
> 3  189 2009-07-04        2
> 4  130 2003-07-05        2
> 5  160 2005-09-24        2
> 6   32 2004-11-06        2
> 7  117 2007-03-17        1
> 8  161 2003-07-19        4
> 9  153 2001-09-15        3
> 10 173 2005-08-27        1
>
>
>> df2
>     A       date category
> 1  102 2006-08-19        3
> 2   68 2004-11-27        2
> 3  137 2003-01-11        1
> 4   39 2002-12-28        2
> 5  127 2004-03-06        4
> 6  125 2002-02-23        2
> 7  150 2002-05-18        4
> 8   19 2003-02-22        1
> 9   80 2000-08-05        1
> 10  94 2003-12-27        1
>
>
> Within a loop, I'd do the following (i is my counter; for the example,
> I set it to 1):
>
>
>> i<-1
>
> # Create the data frames:
>
>> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- df2[which(df2['category']==i),]
>
> # Select only the data from category i
>
>> yeari_1
>     A       date category
> 7  117 2007-03-17        1
> 10 173 2005-08-27        1
>
>> yeari_2
>     A       date category
> 3  137 2003-01-11        1
> 8   19 2003-02-22        1
> 9   80 2000-08-05        1
> 10  94 2003-12-27        1
>
> # Convert dates to integers
>
> year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <-
> as.integer(as.Date(yeari_2[[2]]));
>
>> yeari_1
>     A  date category
> 7  117 13589        1
> 10 173 13022        1
>> yeari_2
>     A  date category
> 3  137 12063        1
> 8   19 12105        1
> 9   80 11174        1
> 10  94 12413        1
>
> # Get differences of all pairs:
>
>> result <- outer(yeari_1[[2]],yeari_2[[2]],'-')
>> result
>     [,1] [,2] [,3] [,4]
> [1,] 1526 1484 2415 1176
> [2,]  959  917 1848  609
>
> # Now, merge the results with the results from all the earlier
> iterations for previous values of i, increment i to the next value,
> and repeat.
>
>
> ----
>
> Ideally, I could accomplish this in some sort of vectorized manner,
> although the Force is not yet strong with me.  Any ideas would be
> appreciated!
>
>
> Regards,
>
> Jonathan
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list