[BioC] A story about IRanges::GappedRanges ... how does it end?

Wed Oct 27 16:09:51 CEST 2010

Howdy,

Thanks for taking the time to read through my musings ... this email
is also a bit long, but I'm just trying to be descriptive enough to
get my point across.

On Mon, Oct 25, 2010 at 10:45 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
<snip>
> It's still supported. I'm guessing it hasn't changed due to lack of use.
> Therefore, no feature or bug requests. Could you please elaborate on why
> GappedRanges is more convenient than GRangesList?

Honestly, it just *feels* right :-)

I want to represent two distances in transcriptome space using their
coordinates in genomic space.

Some of the data we are generating is high throughput SAGE sequencing
data, so I'm developing a package to work with it (soup to nuts)[1]

I wanted to use the GappedRanges for the SAGEseq protocol. Without
getting into too many "boring" details of the protocol, I have two
positions on a transcript (indicating where restriction sites are).
The upstream position is defined by the location of the tag that was
sequenced (the first 4 bp of the tag is the restriction site + 17 bp
of transcript sequence). The downstream position is defined by the
where the position of the next 3' restriction site is. The downstream
3' restriction site on the transcript isn't always the most proximal
3' restriction site in "genome space" since this site might occur in
an intron -- I want to skip these sites and only report sites in
exons.

These two restriction sites define my "fence posts". Does that make sense?

The data I have would be a GRanges object, with a GappedRanges object
as one of the columns of the GRanges elementMetadata slot. A
completely contrived example would look like so:

R> library(GenomicRanges)
R> gr <- GRanges(seqnames='chr1', IRanges(c(1, 100), width=21), strand='+')
R> .gaps <- as(IRangesList(IRanges(c(1, 50, 80), width=c(30, 10, 15)),
                        IRanges(c(100, 150, 180), width=c(30, 10, 15))),
               'GappedRanges')
R> values(gr) <- DataFrame(fence.posts=.gaps)

GRanges with 2 ranges and 1 elementMetadata value
    seqnames     ranges strand |    fence.posts
       <Rle>  <IRanges>  <Rle> | <GappedRanges>
[1]     chr1 [  1,  21]      + |     [  1,  94]
[2]     chr1 [100, 120]      + |     [100, 194]

Positions 94 and 194 represent the next possible position of my
restriction site while 'honoring' exon/transcript structure. Having my
data this way, I can also quickly look at the exon structure present
in the transcript that my "real" tag (the one I sequenced) landed in.

With some trivial-to-define functions over GappedRanges objects:
http://github.com/lianos/GenomicFeaturesX/blob/master/R/GappedRanges-IRanges.R

I can then find out interesting things about my fence.posts. Like:

(i) the distance between them in transcriptome space (by 'minding the gaps'):
R> gwidth(values(gr)$fence.posts, mind.the.gap=TRUE)
[1] 55 55

or genomic space:
R> gwidth(values(gr)$fence.posts, mind.the.gap=FALSE)
[1] 94 95

(ii) I can also use the info embedded in my gapped ranges to inquire
about the exon structure present between my fence posts:

R> ranges(values(gr)$fence.posts, mind.the.gap=TRUE)
CompressedNormalIRangesList of length 2
[[1]]
NormalIRanges of length 3
    start end width
[1]     1  30    30
[2]    50  59    10
[3]    80  94    15

[[2]]
NormalIRanges of length 3
    start end width
[1]   100 129    30
[2]   150 159    10
[3]   180 194    15

(ii) Or I can just ignore the embedded gap/exon structure:
IRanges of length 2
    start end width
[1]     1  94    94
[2]   100 194    95

So -- like I said, it just seems like a more natural fit to me than a
GRangesList here -- and it's a bit less heavy weight than a
GRangesList as I don't need strand/seqinfo information for my gaps
since they are already present in the GRanges themselves.

Hopefully that makes some sense?

-steve

[1] The package can also work with other tag-sequencing data. We're
also developing a "new" high throughput 3'RACE-like protocol to
sequence the very end of transcripts, and there is DeepCAGE (already
in the wild) data that I'd also like my package to be able to deal
with.

(in case anybody was interested :-)

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact