[R] Fast lookup in ragged array

Peter McMahan peter.mcmahan at gmail.com
Fri Mar 16 17:45:34 CET 2007


Well, I hadn't ever seen RBGL before, so that's great. I've been  
using igraph and sna mainly, but there are a few points lacking  
between these two. RBGL solves a lot of problems for me!

But I'm not sure it will solve this specific problem. Are you  
suggesting I use RBGL to do a depth-first search of all the  
subgraphs? For this particular depth-first search I'm not searching  
every subgraph, but just those that are constructed from a minimal  
cutset of the parent subgraph. At each level of the search, I have to  
compute graph cohesion (vertex connectivity), which can take  
considerable time. A lot of computation time is saved by only  
searching subgraphs obtained through cutsets. So a complete search of  
all the subgraphs won't work, but the redundancy I come across is I  
think unavoidable.

The particular algorithm I'm trying to implement is Moody and White's  
cohesive blocking, in which the end result is a nested set of all  
subgraphs with a higher cohesion (connectivity) than their parents.  
(see http://www.santafe.edu/research/publications/workingpapers/ 
00-08-049.pdf )

On Mar 16, 2007, at 11:00 AM, Robert Gentleman wrote:


> why not just use the graph package and RBGL at www.bioconductor.org
>
>
> Peter McMahan wrote:
>
>> Hello,
>> I'm running an algorithm for graph structural cohesion that  
>> requires  a depth-first search of subgraphs of a rather large  
>> network. The  algorithm will necessarily be redundant in the  
>> subgraphs it recurses  to, so to speed up the process I  
>> implemented a check at each subgraph  to see if it's been searched  
>> already.
>> This algorithm is very slow, and takes days to complete on a  
>> graph  with about 200 nodes. It seems that a significant portion  
>> of the  computation time is spent looking up the current subgraph  
>> in the list  of searched subgraphs to see if it is redundant, and  
>> I'm wondering if  there's a faster way to do this.
>> Currently, the list of searched subgraphs is a list   
>> (`theseSearchedBranches`) of unnamed numeric vectors. I'm  
>> checking  against the list using the following command:
>>      if(list(v) %in% theseSearchedBranches){
>>          cat("    Branch already searched: passing.\n\n")
>>          return(NULL)
>>      }
>> v is a sorted numeric, with length anywhere between 3 and 200.   
>> Because theseSearchedBranches gets quite long as the algorithm   
>> progresses, the %in% command is taking maybe one or two seconds  
>> per  lookup.
>> So to reiterate, I have a list() that looks something like this:
>> [[1]]
>> [1] 0 5 6 11 12 13 14 16 19 21 23 24 26 30 31 39 41
>> [18]56 72 75 76 85 95 105 110 134 158 159 165 186
>> [[2]]
>> [1] 0 5 6 11 12 13 14 16 19 21 23 24 26 30 31 39 41
>> [18]56 72 75 76 85 95 105 110 134 147 159 165 186
>> [[3]]
>> [1] 0 5 6 11 12 13 14 16 19 21 23 24 26 30 31 39 41
>> [18]50 56 72 75 85 95 105 110 134 147 158 159 165 186
>> ...
>> and so on for tens of thousands of entries, and I am trying to  
>> find  some sort of fast equivalent for %in% to search it. I'm also  
>> not  adding the vectors to the list in any particular order, as I  
>> don't  think %in% would know how to take advantage of that anyway.
>> Is there a data structure other than list() that I can use that  
>> would  be faster? Would it be better to just create a hashed env  
>> and add  empty variables named "0.5.6.11.12..."?
>> I know there are fast lookup algorithms out there that could take   
>> advantage of the fact that the items being searched are  
>> indiviually  ordered numeric vectors, but I can't find any info  
>> about R  implementations on the mailing lists or help. Is there  
>> any data type  that implements a b-tree type of lookup scheme?
>> Any help would be greatly appreciated.
>> Thanks,
>> Peter
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting- 
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> -- 
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>



More information about the R-help mailing list