[BioC] How to reduce two GRanges simutaneously

Hervé Pagès hpages at fhcrc.org
Wed Apr 9 01:11:25 CEST 2014


mmmh... fast but incorrect :-/

   > x <- GRanges("chr1", IRanges(1:3, width=1))
   > y <- GRanges("chr1", IRanges(c(10,20,10), width=1))
   > reduceTwoGRanges(x, y)
   [[1]]
   GRanges with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chr1    [1, 3]      *
     [2]     chr1    [2, 2]      *
     ---
     seqlengths:
      chr1
        NA

   [[2]]
   GRanges with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chr1  [10, 10]      *
     [2]     chr1  [20, 20]      *
     ---
     seqlengths:
      chr1
        NA

Michael's approach based on detection of the connected components
of the graph of overlapping pairs is the way to go. Note that if
you want to stay close to the semantic of reduce(), you need to
call findOverlaps() with 'maxgap=1' though, so adjacent ranges are
considered to overlap. (The fact that we have to use 'maxgap=1'
for this is confusing because when ranges are adjacent the gap
between them is 0 not 1, but that's another story.)

Cheers,
H.

On 04/08/2014 12:55 PM, Niu, Liang (NIH/NIEHS) [E] wrote:
> Thanks!
> ________________________________________
> From: Hervé Pagès [hpages at fhcrc.org]
> Sent: Tuesday, April 08, 2014 3:42 PM
> To: Michael Lawrence; Niu, Liang (NIH/NIEHS) [E]
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] How to reduce two GRanges simutaneously
>
> Hi Liang, Michael,
>
> Yeah, an interesting problem indeed.
>
> Here is my take on it:
>
>     # Reduce 'x' and 'y' simultaneously. Returns a list of 2 parallel
>     # GRanges, the 1st one is reduced 'x' and the 2nd one reduced 'y'.
>     reduceTwoGRanges <- function(x, y)
>     {
>       if (length(x) != length(y))
>         stop("'x' and 'y' must have the same length")
>       rx_revmap <- mcols(reduce(x, with.revmap=TRUE))$revmap
>       ry_revmap <- mcols(reduce(y, with.revmap=TRUE))$revmap
>       rx_map <- integer(length(x))
>       rx_map[unlist(rx_revmap, use.names=FALSE)] <- togroup(rx_revmap)
>       ry_map <- integer(length(y))
>       ry_map[unlist(ry_revmap, use.names=FALSE)] <- togroup(ry_revmap)
>       sm <- IRanges:::selfmatchIntegerPairs(rx_map, ry_map)
>       ans1 <- unlist(range(split(x, sm)), use.names=FALSE)
>       ans2 <- unlist(range(split(y, sm)), use.names=FALSE)
>       list(ans1, ans2)
>     }
>
> Cheers,
> H.
>
> On 04/08/2014 05:37 AM, Michael Lawrence wrote:
>> This has got to be one of the most complex problems posed here.
>>
>> Here is one approach, but it may be completely wrong:
>>
>> hits_x <- findOverlaps(x)
>> hits_y <- findOverlaps(y)
>> hits_both <- intersect(hits_x, hits_y)
>>
>> I think the constraint of mutual overlap in x and y is now satisfied. In
>> order to reduce, we have to think of the Hits object as a graph, where
>> the connected components are those ranges that need to be reduced.
>>
>> g <- graph::graphNEL(as.character(seq_len(length(x))),
>> split(subjectHits(hits_both), queryHits(hits_both)))
>> comp <- unname(RBGL::connectedComp(g)) # unname just for efficiency
>>
>> That graphNEL constructor might kill your performance? There are other
>> ways to build the graph. Now we can form GRangesLists from the connected
>> components, and since we know they all overlap, just get the range, and
>> unlist them back to GRanges:
>>
>> ids <- as.integer(unlist(comp))
>> reduced_x <- unlist(range(relist(x[ids], comp)))
>> reduced_y <- unlist(range(relist(y[ids], comp)))
>>
>> And combine the result into a DataFrame with the range counts:
>>
>> answer <- DataFrame(x=reduced_x, y=reduced_y, count=elementLengths(comp))
>>
>> Hopefully that gets you closer. This hasn't been tested at all.
>>
>> Michael
>>
>>
>>
>> On Mon, Apr 7, 2014 at 11:40 AM, Niu, Liang (NIH/NIEHS) [E]
>> <liang.niu at nih.gov <mailto:liang.niu at nih.gov>> wrote:
>>
>>      Dear Herve,
>>
>>      I have a question: suppose that I have two GRanges of the same
>>      length, say x and y, where x[i] and y[i] are two genomic ranges that
>>      are paired. What I want to do is to reduce x and y simultaneously,
>>      i.e., combine (x[i],y[i]) and (x[j],y[j]) when x[i] overlaps with
>>      x[j] and y[i] overlaps with y[j]. Ideally, I would like an output in
>>      the following format:
>>
>>      range_x            range_y            count
>>      chr?:???-???     chr?:???-???      ?
>>
>>      where the range_x and range_y are the reduced ranges, and count is
>>      the number of records associated with the chr?:???-???  and
>>        chr?:???-???. How can I do this?
>>
>>      Thanks!
>>
>>      Liang
>>
>>      _______________________________________________
>>      Bioconductor mailing list
>>      Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>      https://stat.ethz.ch/mailman/listinfo/bioconductor
>>      Search the archives:
>>      http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list