[BioC] Flagging overlapping gene exons (was Re: Question about CSAMA10 ...)

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Sep 29 06:56:00 CEST 2010


On Tue, Sep 28, 2010 at 12:28 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/28/2010 08:09 AM, Paul Geeleher wrote:
>> I guess my next question is if there is a convenient way of filtering
>> out the part of the gene
>> that overlaps another one using GenomicRanges?

I've been meaning to do something that basically performs this core
functionality for a while. I just finished putting together a first
cut of it and added it to my GenomicFeatures extensions package:


I'm also not an IRanges guru, so I'm sure there are
better/more-magical ways to do this. I'll need to test this more
thoroughly, but it currently looks to do the right thing for the
example provided below.

The context I'm using it for is to create an "annotated" GRanges
object (1 per chromosome) that contains annotated intervals that (i)
span the length of the chromosome; and (ii) are annotated with exon
information for parts (or all) of exons that exclusively belong to a
given gene.

I think some example code + results will speak better about what I
mean than words can.

Note that if you have GenomicRanges installed and the "testthat"
library, you can easily play along by sourcing urls straight from
github, so let's set up the R environment

R> library(testthat)
R> library(GenomicRanges)

I have a list of GRanges objects (you can use a GRangesList if you
like) with exon annotations for genes. `lchr1` is an example object
for a pseudo "chr1" that only has a few (4) gene annotations on it and
is 10,000 bp long.

GeneA and GeneC have some exons that overlap, and will become disjoint later.
GeneD will cause further "fractionation" if we toss out strand info:

R> source("http://github.com/lianos/GenomicFeaturesX/raw/master/inst/tests/test-annotateChromosome.R")
R> lchr1

GRanges with 4 ranges and 1 elementMetadata value
    seqnames     ranges strand |   exon.anno
       <Rle>  <IRanges>  <Rle> | <character>
[1]     chr1 [ 10,  30]      + |        utr5
[2]     chr1 [ 40,  60]      + |         cds
[3]     chr1 [ 80, 100]      + |         cds
[4]     chr1 [120, 140]      + |        utr3

I've written a function `annotateChromosome` that will calculate the
ranges along a chromosome that exclusive belong to a gene as explained
above, and assigns each interval the appropriate exon annotation.

R> source("http://github.com/lianos/GenomicFeaturesX/raw/master/R/annotateGenome.R")
R> annotateChromosome(lchr1, 'chr1', 1000, stranded=TRUE)
GRanges with 14 ranges and 2 elementMetadata values
     seqnames     ranges strand |   exon.anno      symbol
        <Rle>  <IRanges>  <Rle> | <character> <character>
 [1]     chr1 [ 10,  30]      + |        utr5       GeneA
 [2]     chr1 [ 15,  45]      - |         cds       GeneD
 [3]     chr1 [ 40,  49]      + |         cds       GeneA
 [4]     chr1 [ 50, 100]      - |        utr5       GeneD
 [5]     chr1 [ 50,  60]      + |     overlap          NA
 [6]     chr1 [ 61,  79]      + |         cds       GeneC
 [7]     chr1 [ 80,  85]      + |     overlap          NA
 [8]     chr1 [ 86, 100]      + |         cds       GeneA
 [9]     chr1 [105, 115]      + |         cds       GeneC
[10]     chr1 [120, 140]      + |        utr3       GeneA
[11]     chr1 [200, 230]      + |        utr5       GeneB
[12]     chr1 [250, 280]      + |         cds       GeneB
[13]     chr1 [300, 330]      + |         cds       GeneB
[14]     chr1 [400, 430]      + |        utr3       GeneB

You can try setting `stranded=FALSE` to see how that changes the results.

If you look at the function here:

The lapply block that generates the "interval.annos" object (~ line
34) does the bookkeeping involved in splitting apart overlapping
regions and keeping only the parts of exons that are unique to each
gene (look around the call to tabulate(...)).

The lapply block that generates "clean.list" (~ line 51) goes back to
the original annotated gene list and picks out the appropriate
annotations for each now-disjoint exon.

That's about it.

As I said earlier, this is a first cut at doing this. I'm sure it can
be cleaner and done more efficiently, but I thought you'd be
interested in the code that does the disjoining stuff (in the context
of this function, or just by itself).

It's only about 20 lines of code, so ... you might not have to do a
rewrite in Python after all :-) [not that I have anything against
Python .. I rather like it]


ps - the content/validity of those github links will (obviously)
change over time. In case they change before you get a change to play
with the contents of this message, you can get the code @ the time of
this commit from these links:

annotateChromosome function:

code to generate the example lchr1 object:

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the Bioconductor mailing list