[R] data manipulation in R

Patrick Ball lookout_20005 at yahoo.com
Tue Apr 17 00:27:03 CEST 2001

Wow! great answers from Thomas and Reid.  Very helpful

To continue the discussion, Thomas rightly pointed out
(1) that the matrix I described is complete, not
lower-triangular as I'd erroneously written; and (2)
you have to iterate over either events or
observations.  My solution went over obersvations. 
His description below is exactly what we see, events
[A] by observers [B].

> Suppose A is the matrix of zeros & ones, in your
> case
> > A
>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
> [1,]    1    1    1    0    0    0    0
> [2,]    0    1    0    1    1    0    0
> [3,]    0    0    0    1    0    0    0
> [4,]    0    0    0    1    0    1    1

and as soon as I'm back to my home machine I'll test
the solution he suggests:
> We can make an incidence matrix B for the graph
>   nevent<-NCOL(A)
>   nobs<-NROW(A)
>   B<-matrix(rowsum(apply(A,2,function(x)
>      outer(x,x)),rep(1,nobs*nobs)),ncol=nobs)
> And now find the connected components by powering up
> B
> D<-(B>0)
> for(i in 1:ceiling(log(nobs,2))){
> 	Dnew<-(D%*%D)>0
> 	if (all(Dnew==D)) break
> 	D<-Dnew
> }
> I think we now have D[i,j]==TRUE if i,j are in the
> same circuit.

Reid suggested that 

> Assumming I've correctly understood the problem, the
> algorithm you've
> sketched seems not to take proper account of
> observers (nodes) separated by
> more than one other observer. After the first two
> steps, you have for each i
> the set N(i) of its neighbors (nodes connected to i
> by an edge) and from
> this data it is essentially the same problem as
> before to decide which of
> these sets overlap etc. In other words you need to
> iterate these steps as
> many times as the length of the longest distance in
> the graph... This could
> be made to work but isn't efficient.

but actually this is part of the benefit of SQL.  The
data look like this:

dataset pairs
ev.  obs.
A    B
a    1
a    1
a    1
b    2
b    4
b    5
c    4
d    4
d    6
d    7

# in SQL pseudocode
 # get list of all observations
sele dist pairs.A order by 1 into out1
 # iterate over observations
for each i in (out1.A)
    # get all the events observed by i
   sele dist pairs.B where pairs.A=i into tmp1
    # get all the observations which observe 
    # one or more events in tmp1; this is
    # all the observations which link to i
   sele dist pairs.A where A in (sele * from tmp1) _
into tmp2
    #count [i_m] 
   sele cnt(*) from tmp1 into cnt1
    # assign events [i_m] to i's circuit if i_m > i
    # do some basic counter management before this
    # step: i.circuit has either been set in an 
    # earlier iteration or is blank and initialized;
    # if it was set earlier, we are extending it 
    # beyond the first hops in the graph
   update out1 where out1.A>tmp2.A set _

next i

The SQL calls are fast b/c there are usually
relatively few (<20) event-observation pairs for any
event or observation.  So the solution scales
relatively linearly with the number of events or
observations X the number of links.  In my data, ~10^3
- 10^4 events & observations, 1/3 as many pairs, and
~1.25 links/observation.  So it's not computationally

I will be interested to test the SQL against Thomas' R
solution (and will post results if there is interest),
and I'd love to look at C or C++ code as well.  Thanks
again to Reid and Thomas - PB.

Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list