[R] Needing a better solution to a lookup problem.

Thu Mar 15 14:37:35 CET 2012

Thanks for the point in the right direction.  I now have a great solution.

> library(GenomicRanges)
> system.time({
+ snplist<-with(snp, GRanges(CHR, IRanges(POS, POS)))
+ locations<-with(targets, GRanges(CHR, IRanges(START, STOP)))
+ olaps<-findOverlaps(snplist, locations)
+ })
   user  system elapsed 
   0.70    0.00    0.71

Brian

-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Wednesday, March 14, 2012 2:44 PM
To: Davis, Brian
Cc: r-help at R-project.org
Subject: Re: [R] Needing a better solution to a lookup problem.

On Mar 14, 2012, at 3:27 PM, Davis, Brian wrote:

> I have a solution (actually a few) to this problem, but none are 
> computationally efficient enough to be useful.  I'm hoping someone can 
> enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data 
> for the location).  For each pair I need to determine if it is with in 
> a given data frame of ranges.  I need to keep only the pairs that are 
> within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
> snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
> CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my 
> current data set is ~100K snp entries, and my regions table has ~200K 
> entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've 
> tried various versions of mapply/lapply, for loops, etc which get the 
> answer for small sets but takes hours (per file) on my real data.  
> Bioconductor seemed like the obvious place to look, but my GoogleFu 
> must not be that great.  I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly 
> appreciated.

The usual BioC recommendation for this sort of problem is package IRanges. And that mailing list probably has many readers who have used that package, unkike this mailing list.

It purported to handle overlapping ranges as well as the non- overlapping problem you pose.

http://www.googlesyndicatedsearch.com/u/newcastlemaths?q=+chromosome+position+iranges&sa=Google+Search

-- 

David Winsemius, MD
West Hartford, CT