[R] Help with aggregate and cor

Wed Mar 10 19:39:55 CET 2010

Hi Ista,

Many thanks, the plyr package was just what I needed.

Because I did such a bad job with my question (no data, etc etc), here
is my current solution:

First, I grabbed my data from PostgreSQL as follows:

library('RPostgreSQL')
m <- dbDriver("PostgreSQL")
con <-
dbConnect(m,user="user",password="yourpassword",host="super.secret.host.edu",dbname="yourdb")

rs <- dbSendQuery(con,"select vds_id, to_char(ts,'Dy') as dow,date_trunc('hour'::text, ts) as tshour, n1,o1, n2,o2, n3,o3, n4,o4, n5,o5 from data_raw  where vds_id=1201087 and  (ts>'Mar 1, 2007' and ts<'Apr 1, 2007 00:00:00')")

df.il <- fetch(res, n = -1)

This timeperiod grabs 88,000 odd records.  The data look like this:

 summary(df.i1)
     vds_id            dow                tshour                   
 Min.   :1201087   Length:88070       Min.   :2007-03-01 00:00:00  
 1st Qu.:1201087   Class :character   1st Qu.:2007-03-08 19:00:00  
 Median :1201087   Mode  :character   Median :2007-03-16 15:00:00  
 Mean   :1201087                      Mean   :2007-03-16 13:51:31  
 3rd Qu.:1201087                      3rd Qu.:2007-03-24 08:00:00  
 Max.   :1201087                      Max.   :2007-03-31 23:00:00  
       ts                            n1               o1         
 Min.   :2007-03-01 00:00:30   Min.   : 0.000   Min.   :0.00000  
 1st Qu.:2007-03-08 19:09:37   1st Qu.: 3.000   1st Qu.:0.01620  
 Median :2007-03-16 15:08:45   Median : 8.000   Median :0.04860  
 Mean   :2007-03-16 14:21:18   Mean   : 8.147   Mean   :0.05024  
 3rd Qu.:2007-03-24 08:25:52   3rd Qu.:12.000   3rd Qu.:0.07330  
 Max.   :2007-03-31 23:59:30   Max.   :35.000   Max.   :0.79440  
       n2               o2                n3               o3         
 Min.   : 0.000   Min.   :0.00000   Min.   : 0.000   Min.   :0.00000  
 1st Qu.: 4.000   1st Qu.:0.02160   1st Qu.: 3.000   1st Qu.:0.01940  
 Median : 8.000   Median :0.04580   Median : 6.000   Median :0.03900  
 Mean   : 7.268   Mean   :0.04584   Mean   : 5.682   Mean   :0.04178  
 3rd Qu.:10.000   3rd Qu.:0.06370   3rd Qu.: 8.000   3rd Qu.:0.05910  
 Max.   :27.000   Max.   :0.77750   Max.   :27.000   Max.   :0.63060  
       n4               o4                n5               o5         
 Min.   : 0.000   Min.   :0.00000   Min.   : 0.000   Min.   :0.00000  
 1st Qu.: 2.000   1st Qu.:0.01450   1st Qu.: 1.000   1st Qu.:0.00640  
 Median : 5.000   Median :0.03400   Median : 3.000   Median :0.02040  
 Mean   : 5.418   Mean   :0.03811   Mean   : 3.706   Mean   :0.03085  
 3rd Qu.: 8.000   3rd Qu.:0.05510   3rd Qu.: 6.000   3rd Qu.:0.04530  
 Max.   :28.000   Max.   :0.59930   Max.   :27.000   Max.   :0.66210  

df.i1[1,]
   vds_id dow     tshour                  ts n1     o1 n2     o2 n3     o3 n4     o4 n5     o5
1 1201087 Thu 2007-03-01 2007-03-01 00:00:30  6 0.0373  5 0.0291  2 0.0217  2 0.0109  1 0.0086

The 'n_i' values are vehicle counts in lane i on a freeway in 30 seconds,
and the 'o_i' are occupancy values for the lane over those 30 seconds,
ranging from 0 (nothing over the sensor) to 1 (somebody parked over
the sensor).  I want to isolate just those time periods when n and o
are highly correlated, as I'm exploring whether that indicates free flow
traffic conditions.

My final function after hacking around a bit last night looks like this

cor.dat <- function(df) {
  cor.l1  <-  cor.test(df$n1,df$o1)
  cor.l2  <-  cor.test(df$n2,df$o2)
  cor.l3  <-  cor.test(df$n3,df$o3)
  cor.l4  <-  cor.test(df$n4,df$o4)
  cor.l5  <-  cor.test(df$n5,df$o5)
  c(l1=cor.l1,l2=cor.l2,l3=cor.l3,l4=cor.l4,l5=cor.l5)
}

(I'm not really sure what the best way to handle that return value is, but
at least this works for what I want...see below).

Run that function hourly with plyr

output.hourly <- dlply(df.i1,"tshour",cor.dat)

(because I used c(...) the output is ugly, but I can do this):

get.cor <- function (l,w) {
  l[w]
}

So to get the "estimate" of the parameter for lane 1:

g <- lapply(output.hourly,get.cor,"l1.estimate")
g[1]
$`2007-03-01 00:00:00`
$`2007-03-01 00:00:00`$l1.estimate
      cor 
0.9845006 

plot(unlist(g))

etc.

Super ugly, but I'm slowly remembe-R-ing.

Again thanks a lot for the tip.  

Regards, 
James

On Tue, Mar 09, 2010 at 10:24:05PM -0500, Ista Zahn wrote:
> Hi James,
> It would really help if you gave us a sample of the data you are
> working with. The following is not tested, because I don't have your
> data and am too lazy to construct a similar example dataset for you,
> but it might get you started.
> 
> You can try using a for loop along the lines of
> 
> output <- data.frame(obsfivemin = obsfivemin, 5min.cor =
> vector(length=length(obsfivemin)))
> for (f in fivemin){
>        output$5min.cor[obsfivemin==f] <- cor(df[obsfivemin==f, c("v", "o")])
>      }
> 
> Or you can try with the plyr package something like
> 
> cor.dat <- function(df) {
>   cor(df[,c("v", "o")])
> }
> 
> library(plyr)
> dlply(df, obsfivemin, cor.dat)
> 
> Good luck,
> Ista
> 
> 
> On Tue, Mar 9, 2010 at 9:36 PM, James Marca <jmarca at translab.its.uci.edu> wrote:
> > Hello,
> >
> > I do not understand the correct way to approach the following problem
> > in R.
> >
> > I have observations of pairs of variables, v1, o1, v2, o2, etc,
> > observed every 30 seconds.  What I would like to do is compute the
> > correlation matrix, but not for all my data, just for, say 5 minutes
> > or 1 hour chunks.
> >
> > In sql, what I would say is
> >
> >    select id, date_trunc('hour'::text, ts) as tshour, corr(n1,o1) as corr1
> >    from raw30s
> >    where id = 1201087  and
> >          (ts between 'Mar 1, 2007' and 'Apr 1, 2007')
> >    group by id,tshour order by id,tshour;
> >
> >
> > I've pulled data from PostgreSQL into R, and have a dataframe
> > containing a timestamp column, v, and o (both numeric).
> >
> > I created an grouping index for every 5 minutes along these lines:
> >
> >    obsfivemin <- trunc(obsts,units="hours")
> >                   +( floor( (obsts$min / 5 ) ) * 5 * 60 )
> >
> > (where obsts is the sql timestamp converted into a DateTime object)
> >
> > Then I tried aggregate(df,by=obsfivemin,cor), but that seemed to pass
> > just a single column at a time to cor, not the entire data frame.  It
> > worked for mean and sum, but not cor.
> >
> > In desperation, I tried looping over the different 5 minute levels and
> > computing cor, but I'm so R-clueless I couldn't even figure out how to
> > assign to a variable inside of that loop!
> >
> > code such as
> >
> >    for (f in fivemin){
> >        output[f] <- cor(df[grouper==f,]); }
> >
> > failed, as I couldn't figure out how to initialize output so that
> > output[f] would accept the output of cor.
> >
> > Any help or steering towards the proper R-way would be appreciated.
> >
> > Regards,
> >
> > James Marca
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> -- 
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
> 
> 
> 
> --==============24723324=Content-Type: message/rfc822
> MIME-Version: 1.0

-- 
James E. Marca, PhD
Researcher
Institute of Transportation Studies
AIRB Suite 4000
University of California
Irvine, CA 92697-3600
jmarca at translab.its.uci.edu
(949) 824-6287

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.