[BioC] Using precede()/follow() to find two ranges

Thu Aug 22 16:57:19 CEST 2013

Dolev, 

If  your requiremnts should change to include a maximum radius (say, 1000) around your query that you will consider subject hits, you might instead use and approach 

ol.df<-within(as.data.frame(findOverlaps(resize(query,1000,use.names=FALSE),subj)),d<-distance(query[queryHits],subj[subjectHits]))

now, use your favorite R approach to filter ol.df based on your other criteria.

Just sayin,

Malcolm

 >-----Original Message-----
 >From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
 >Sent: Thursday, August 22, 2013 9:25 AM
 >To: 'd r'; 'Martin Morgan'
 >Cc: 'Steve Lianoglou'; 'bioconductor at r-project.org list'
 >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >
 >Dolev,
 >
 >If there are no overlaps within the ranges of your subjects then approaches involving precede and follow will of course not be
 >effected by overlaps between ranges in your subjects, since, well, there aren't any.
 >
 >Still be careful with approaches suing precede/follow alone since any of your 1 nt wide query ranges which are themselves exactly
 >present in your 1 nt-wide subject ranges won't be found using precede/follow.
 >
 >~Malcolm
 >
 >
 >From: d r [mailto:dolevrahat at gmail.com]
 >Sent: Thursday, August 22, 2013 2:40 AM
 >To: Martin Morgan
 >Cc: Cook, Malcolm; Steve Lianoglou; bioconductor at r-project.org list
 >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >
 >Thanks Martin, this sound like a useful approach. I will try to implement it.
 >Malcolm -  I am not sure that I understood you're comments.
 >I do not "want explicitly to exclude overlaps with my cgs". Rather, given that my particular problem invloves intersecting single
 >nuclotide ranges,
 > I do not think that the problem of overlaps is likely to arise.  Am I missing something here?
 >
 >Thanks agian for all of your for being so helpful
 >Dolev
 >
 >On Thu, Aug 22, 2013 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org<mailto:mtmorgan at fhcrc.org>> wrote:
 >On 08/21/2013 01:46 PM, Cook, Malcolm wrote:
 >Well, there are many ways to skin this kind of cat (apologies to the internet).
 >
 >Indeed, there are many cats that are slight variations on this cat.
 >
 >re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow"
 >
 >I think that you should probably be handling this else-how.  If what you mean is that there should be some minimal distance between
 >your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced
 >with the possibilities.
 >
 >That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I
 >think you will find that you will want to compose something along these lines:
 >
 >
 >1)      use precede and follow each twice to generate your two candidate neighbors
 >
 >Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but maybe with
 >
 >    subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+")
 >    query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+")
 >
 >The first range that query follows is
 >
 >    f1 = follow(query, subj)
 >
 >and the second  is
 >
 >    f2 = follow(subj[f1], subj)
 >
 >giving the indexes into subj as
 >
 >    > f1
 >    [1] 3 4 5
 >    > f2
 >    [1] 2 3 4
 >
 >I guess there are problems when there is no following range (so need to check for NA's before doing the second follow?
 >
 >Another strategy might simply
 >
 >    sort(subj)
 >
 >and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). Probably strand will confusing things!
 >
 >Martin
 >
 >2)      use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped
 >
 >Cheers,
 >
 >Malcolm
 >
 >From: d r [mailto:dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>]
 >Sent: Wednesday, August 21, 2013 2:57 PM
 >To: Cook, Malcolm
 >Cc: Steve Lianoglou; bioconductor at r-project.org<mailto:bioconductor at r-project.org> list
 >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >
 >Hi Malcolm
 >Thanks for your thoughtfulness.
 >For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k
 >probes, which are also single nucleotide ranges.
 >Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may
 >overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow.
 >So, any insight you may be able to offer on this problem will be highly appreciated.
 >
 >However, I think that for my particular problem overlapping ranges should not be an issue.
 >On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm
 ><MEC at stowers.org<mailto:MEC at stowers.org><mailto:MEC at stowers.org<mailto:MEC at stowers.org>>> wrote:
 >Dolev,
 >
 >Before chiming in on the problem as it is currently framed, I want to make sure of something.
 >
 >The documentation for precede and follow read 'Overlapping ranges are excluded.'.
 >
 >For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover
 >this association.
 >
 >Is this really desirable for your application?
 >
 >-Malcolm
 >
 >
 >  >-----Original Message-----
 >  >From: bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org><mailto:bioconductor-bounces at r-
 >project.org<mailto:bioconductor-bounces at r-project.org>> [mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-
 >bounces at r-project.org><mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org>>] On Behalf Of
 >d r
 >  >Sent: Wednesday, August 21, 2013 12:15 PM
 >  >To: Steve Lianoglou
 >  >Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org><mailto:bioconductor at r-project.org<mailto:bioconductor at r-
 >project.org>> list
 >  >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >  >
 >  >Hi Steve and all
 >  >
 >  >Thanks for your suggestion. I was in fact thinking in this direction.
 >  >However, by completly discarding the first set of hits before querying the
 >  >second time I run the risk of missing ranges that may be potential second
 >  >hits beacuse they were discarded after being a first hit for other ranges.
 >  >For examle, if I have these two GRanges:
 >  >
 >  >gr1:
 >  >
 >  >GRanges with 2 ranges and 1 metadata column:
 >  >
 >  >       seqnames                 ranges strand |      names
 >  >
 >  >          <Rle>              <IRanges>  <Rle> |   <factor>
 >  >
 >  >   [1]     chr6 [125284212, 125284212]      * | cg00991794
 >  >
 >  >   [2]     chr6 [150465049, 150465049]      * | cg02250071
 >  >
 >  >
 >  >
 >  >
 >  >
 >  >gr2:
 >  >
 >  >GRanges with 4 ranges and 1 metadata column:
 >  >
 >  >       seqnames                 ranges strand |      gene
 >  >
 >  >          <Rle>              <IRanges>  <Rle> |  <factor>
 >  >
 >  >   [1]     chr6 [126284212, 124284212]      + |      PGM3
 >  >
 >  >   [2]     chr6 [160920998, 150920998]      + |   PLEKHG1
 >  >
 >  >   [3]     chr6 [ 83903012,  83903012]      + |      PGM3
 >  >
 >  >   [4]     chr6 [190102159, 170102159]       + |    WDR27
 >  >
 >  >The first hits will be gr2[1] and gr2[2], which will both be discarded.
 >  >
 >  >Now if I call precede() again the hits I will get will be gr2[3] and
 >  >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for
 >  >gr1[1].
 >  >
 >  >
 >  >I was thinking that applying a function that will do what Steve
 >  >suggested might do the trick if it can run on one range of gr1 at a
 >  >time without modifying gr2
 >  >
 >  >something along the lines of:
 >  >
 >  >two_hits_apply<-function(gr1,gr2)
 >  >{
 >  >
 >  >p1<-precede(gr1,gr2)
 >  >
 >  >gr2.less<-gr2[-p1]
 >  >
 >  >p2<-precede(gr1,gr2.less)
 >  >
 >  >hits<-c(p1,p2)
 >  >
 >  >hits
 >  >
 >  >}
 >  >
 >  >now all I need is a way to apply a functuon over two GRanegs object.
 >  >If I get it right, I will need somehting like mapply(), but that can
 >  >actuaaly work on GRanges.
 >  >
 >  >Is such a function exists, or alternativly, is there a way to do this
 >  >with the convential apply functions that I miss?
 >  >
 >  >Many thanks in advance
 >  >Dolev
 >  >
 >  >
 >  >
 >  >
 >  >
 >  >
 >  >
 >  >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou
 >
 >><lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.com><mailto:lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.
 >com>>>wrote:
 >  >
 >  >> Hi,
 >
 >  >>
 >  >> On Wed, Aug 21, 2013 at 7:25 AM, d r
 ><dolevrahat at gmail.com<mailto:dolevrahat at gmail.com><mailto:dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>>> wrote:
 >  >> > Hello
 >  >> >
 >  >> > I am looking for a way to find the two preceiding/folliwng ranges in one
 >  >> > GRanges object to each range in a second GRanges object.
 >  >> >
 >  >> > In other words, I am looking for a variation on precede() or follow()
 >  >> that
 >  >> > will return for each range in x two ranges in subject instead of one: the
 >  >> > nearest preceidng(following) range and the second nearest preceding
 >  >> > (following) range.
 >  >> >
 >  >> > Is there a way to do this? (sorry for not giving any more constructive
 >  >> > suggestions, I know relativily  little on how GRanges works any therefore
 >  >> > am at a loss as to how to proceed)
 >  >>
 >  >> A first draft way of doing that would be to simply use the results
 >  >> from the first precede to remove those elements from the second call?
 >  >>
 >  >> For instance, if you have two ranges you are querying, gr1 and gr2:
 >  >>
 >  >> To get the immediately preceding ranges:
 >  >>
 >  >> R> p1 <- precede(gr1, gr2)
 >  >>
 >  >> Then to get the ones immediately after that:
 >  >>
 >  >> R> gr2.less <- gr2[-p1]
 >  >> R> p2 <- precede(gr1, gr2.less)
 >  >>
 >  >> Then you can see who is who with `gr2[p1]` and the who-who is who with
 >  >> `gr2.less[p2]` ... that should get you pretty close -- will likely
 >  >> have to handle edge cases, for instance when there are no preceding
 >  >> ranges, I think you get an NA (if I recall) so think about what you
 >  >> want to do with those.
 >  >>
 >  >> -steve
 >  >>
 >  >> --
 >  >> Steve Lianoglou
 >  >> Computational Biologist
 >  >> Bioinformatics and Computational Biology
 >  >> Genentech
 >  >>
 >  >
 >  >      [[alternative HTML version deleted]]
 >  >
 >  >_______________________________________________
 >  >Bioconductor mailing list
 >  >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org><mailto:Bioconductor at r-project.org<mailto:Bioconductor at r-
 >project.org>>
 >
 >  >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >  >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
 >
 >
 >        [[alternative HTML version deleted]]
 >
 >_______________________________________________
 >Bioconductor mailing list
 >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
 >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
 >
 >
 >--
 >Computational Biology / Fred Hutchinson Cancer Research Center
 >1100 Fairview Ave. N.
 >PO Box 19024 Seattle, WA 98109
 >
 >Location: Arnold Building M1 B861
 >Phone: (206) 667-2793<tel:%28206%29%20667-2793>
 >
 >
 >	[[alternative HTML version deleted]]
 >
 >_______________________________________________
 >Bioconductor mailing list
 >Bioconductor at r-project.org
 >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor