[BioC] Using precede()/follow() to find two ranges

Thu Aug 22 16:12:41 CEST 2013

 >-----Original Message-----
 >From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
 >Sent: Wednesday, August 21, 2013 11:58 PM
 >To: Cook, Malcolm
 >Cc: 'd r'; 'Steve Lianoglou'; 'bioconductor at r-project.org list'
 >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >
 >On 08/21/2013 01:46 PM, Cook, Malcolm wrote:
 >> Well, there are many ways to skin this kind of cat (apologies to the internet).
 >>
 >> Indeed, there are many cats that are slight variations on this cat.
 >>
 >> re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow"
 >>
 >> I think that you should probably be handling this else-how.  If what you mean is that there should be some minimal distance
 >between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought
 >when faced with the possibilities.
 >>
 >> That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I
 >think you will find that you will want to compose something along these lines:
 >>
 >>
 >> 1)      use precede and follow each twice to generate your two candidate neighbors
 >
 >Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but
 >maybe with
 >
 >     subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+")
 >     query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+")
 >
 >The first range that query follows is
 >
 >     f1 = follow(query, subj)
 >
 >and the second  is
 >
 >     f2 = follow(subj[f1], subj)'

I think this _is_ being a little simplistic.  
The complexities arise when/if ranges in subj overlap each other.  
In such cases, only the closest range to the query will be picked up, and the subsequent overlapping range will be skipped.
This is the consequence of the fact that 'Overlapping ranges are excluded.' (c.f. ?precede).
This approach is essentially step 1 of my two step outline.  Step 2 is intended to adjust for the case of overlapping ranges in subj.

>giving the indexes into subj as
 >
 >     > f1
 >     [1] 3 4 5
 >     > f2
 >     [1] 2 3 4
 >
 >I guess there are problems when there is no following range (so need to check
 >for NA's before doing the second follow?
 >
 >Another strategy might simply
 >
 >     sort(subj)
 >
 >and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]).

...except also when subj[f1] overlaps subj[f1-1] 
but I like the insight.

>Probably strand will confusing things!

Precede and follow optionally take strand into account with ignore.strand.

>Martin
 >
 >>
 >> 2)      use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped

Actually implementing this is tricky.  I really suggest you think exactly what your requirements are regarding proximity between query and subject, and overlap between subjects before you go down this route.

~Malcolm

 >>
 >> Cheers,
 >>
 >> Malcolm
 >>
 >> From: d r [mailto:dolevrahat at gmail.com]
 >> Sent: Wednesday, August 21, 2013 2:57 PM
 >> To: Cook, Malcolm
 >> Cc: Steve Lianoglou; bioconductor at r-project.org list
 >> Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >>
 >> Hi Malcolm
 >> Thanks for your thoughtfulness.
 >> For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k
 >probes, which are also single nucleotide ranges.
 >> Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may
 >overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow.
 >> So, any insight you may be able to offer on this problem will be highly appreciated.
 >>
 >> However, I think that for my particular problem overlapping ranges should not be an issue.
 >>
 >> On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org>> wrote:
 >> Dolev,
 >>
 >> Before chiming in on the problem as it is currently framed, I want to make sure of something.
 >>
 >> The documentation for precede and follow read 'Overlapping ranges are excluded.'.
 >>
 >> For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not
 >discover this association.
 >>
 >> Is this really desirable for your application?
 >>
 >> -Malcolm
 >>
 >>
 >>   >-----Original Message-----
 >>   >From: bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org> [mailto:bioconductor-bounces at r-
 >project.org<mailto:bioconductor-bounces at r-project.org>] On Behalf Of d r
 >>   >Sent: Wednesday, August 21, 2013 12:15 PM
 >>   >To: Steve Lianoglou
 >>   >Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org> list
 >>   >Subject: Re: [BioC] Using precede()/follow() to find two ranges
 >>   >
 >>   >Hi Steve and all
 >>   >
 >>   >Thanks for your suggestion. I was in fact thinking in this direction.
 >>   >However, by completly discarding the first set of hits before querying the
 >>   >second time I run the risk of missing ranges that may be potential second
 >>   >hits beacuse they were discarded after being a first hit for other ranges.
 >>   >For examle, if I have these two GRanges:
 >>   >
 >>   >gr1:
 >>   >
 >>   >GRanges with 2 ranges and 1 metadata column:
 >>   >
 >>   >       seqnames                 ranges strand |      names
 >>   >
 >>   >          <Rle>              <IRanges>  <Rle> |   <factor>
 >>   >
 >>   >   [1]     chr6 [125284212, 125284212]      * | cg00991794
 >>   >
 >>   >   [2]     chr6 [150465049, 150465049]      * | cg02250071
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >gr2:
 >>   >
 >>   >GRanges with 4 ranges and 1 metadata column:
 >>   >
 >>   >       seqnames                 ranges strand |      gene
 >>   >
 >>   >          <Rle>              <IRanges>  <Rle> |  <factor>
 >>   >
 >>   >   [1]     chr6 [126284212, 124284212]      + |      PGM3
 >>   >
 >>   >   [2]     chr6 [160920998, 150920998]      + |   PLEKHG1
 >>   >
 >>   >   [3]     chr6 [ 83903012,  83903012]      + |      PGM3
 >>   >
 >>   >   [4]     chr6 [190102159, 170102159]       + |    WDR27
 >>   >
 >>   >The first hits will be gr2[1] and gr2[2], which will both be discarded.
 >>   >
 >>   >Now if I call precede() again the hits I will get will be gr2[3] and
 >>   >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for
 >>   >gr1[1].
 >>   >
 >>   >
 >>   >I was thinking that applying a function that will do what Steve
 >>   >suggested might do the trick if it can run on one range of gr1 at a
 >>   >time without modifying gr2
 >>   >
 >>   >something along the lines of:
 >>   >
 >>   >two_hits_apply<-function(gr1,gr2)
 >>   >{
 >>   >
 >>   >p1<-precede(gr1,gr2)
 >>   >
 >>   >gr2.less<-gr2[-p1]
 >>   >
 >>   >p2<-precede(gr1,gr2.less)
 >>   >
 >>   >hits<-c(p1,p2)
 >>   >
 >>   >hits
 >>   >
 >>   >}
 >>   >
 >>   >now all I need is a way to apply a functuon over two GRanegs object.
 >>   >If I get it right, I will need somehting like mapply(), but that can
 >>   >actuaaly work on GRanges.
 >>   >
 >>   >Is such a function exists, or alternativly, is there a way to do this
 >>   >with the convential apply functions that I miss?
 >>   >
 >>   >Many thanks in advance
 >>   >Dolev
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >
 >>   >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou
 >>   ><lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.com>>wrote:
 >>   >
 >>   >> Hi,
 >>   >>
 >>   >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>> wrote:
 >>   >> > Hello
 >>   >> >
 >>   >> > I am looking for a way to find the two preceiding/folliwng ranges in one
 >>   >> > GRanges object to each range in a second GRanges object.
 >>   >> >
 >>   >> > In other words, I am looking for a variation on precede() or follow()
 >>   >> that
 >>   >> > will return for each range in x two ranges in subject instead of one: the
 >>   >> > nearest preceidng(following) range and the second nearest preceding
 >>   >> > (following) range.
 >>   >> >
 >>   >> > Is there a way to do this? (sorry for not giving any more constructive
 >>   >> > suggestions, I know relativily  little on how GRanges works any therefore
 >>   >> > am at a loss as to how to proceed)
 >>   >>
 >>   >> A first draft way of doing that would be to simply use the results
 >>   >> from the first precede to remove those elements from the second call?
 >>   >>
 >>   >> For instance, if you have two ranges you are querying, gr1 and gr2:
 >>   >>
 >>   >> To get the immediately preceding ranges:
 >>   >>
 >>   >> R> p1 <- precede(gr1, gr2)
 >>   >>
 >>   >> Then to get the ones immediately after that:
 >>   >>
 >>   >> R> gr2.less <- gr2[-p1]
 >>   >> R> p2 <- precede(gr1, gr2.less)
 >>   >>
 >>   >> Then you can see who is who with `gr2[p1]` and the who-who is who with
 >>   >> `gr2.less[p2]` ... that should get you pretty close -- will likely
 >>   >> have to handle edge cases, for instance when there are no preceding
 >>   >> ranges, I think you get an NA (if I recall) so think about what you
 >>   >> want to do with those.
 >>   >>
 >>   >> -steve
 >>   >>
 >>   >> --
 >>   >> Steve Lianoglou
 >>   >> Computational Biologist
 >>   >> Bioinformatics and Computational Biology
 >>   >> Genentech
 >>   >>
 >>   >
 >>   >      [[alternative HTML version deleted]]
 >>   >
 >>   >_______________________________________________
 >>   >Bioconductor mailing list
 >>   >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
 >>   >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >>   >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>
 >>
 >> 	[[alternative HTML version deleted]]
 >>
 >> _______________________________________________
 >> Bioconductor mailing list
 >> Bioconductor at r-project.org
 >> https://stat.ethz.ch/mailman/listinfo/bioconductor
 >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>
 >
 >
 >--
 >Computational Biology / Fred Hutchinson Cancer Research Center
 >1100 Fairview Ave. N.
 >PO Box 19024 Seattle, WA 98109
 >
 >Location: Arnold Building M1 B861
 >Phone: (206) 667-2793