[BioC] A code kata: evening-out *Ranges ends.

Tue Aug 24 23:47:58 CEST 2010

Hi,

Preamble: This is a low priority question. It's basically something
I'm just asking to see how one might code it more cleanly.

One of Martin's responses to a question today suggested that the
person use IRanges::slice to perform a specific task:

http://thread.gmane.org/gmane.science.biology.informatics.conductor/30550/focus=30554

I was quite pleasantly surprised from that post since as one step in
cleaning some of my short-read input data. I haven't stumbled on that
before, and I essentially had written up a small piece of code that
does the same thing that the IRanges::slice operation does, but using
normal R operations on a vector-converted Rle object.

Using IRanges::slice is much nicer, and no doubt more efficient
(although my implementation wasn't terribly inefficient ... wonderful.

That's what's prompting me to ask this question.

I have another small piece of code that further cleans my data, but
this one IS terribly inefficient (it has a for loop). It runs over a
(large) I/GRanges object (representing short reads) and sets a
"cluster" of reads to have the same start/end. I'll show you what I
mean with example data and code below.

As some background, my data is coming from a high-throughput SAGE-like
protocol, where I get one very specific read/tag per transcript. As a
result of some variability in enzyme digestion (either "naturally
occurring" in the cell, or as a result of the protocol itself), the
resulting reads are one or two bp's off of what they should be.

Here's some fake data:

  reads <- GRanges(seqnames='chr1', strand='+',
                ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE),
                                 sample(100:108, 10, replace=TRUE)),
                  width=sample(18:20, 20, replace=TRUE)))

You can see there are two "islands" of reads there, at around
positions 20-40, and 100-120.
What I want to do is to "fix" the reads in each island to have the
same start/end position.

Here's code I have to do that (forget any "strand" issues here --
those are already taken care of):

  fence.posts <- reduce(reads)
  o <- findOverlaps(fence.posts, reads)
  subjects <- subjectHits(o)
  query <- queryHits(o)

  for (idx in unique(query)) {
    adjust <- subjects[query == idx]
    start(reads[adjust]) <- start(fence.posts[idx])
    end(reads[adjust]) <- end(fence.posts[idx])
  }

It works, but it's slow.

So, I'm curious if there is a more clever/idiomatic way to do this
using the IRanges infrastructure while at the same time being more
efficient.

Thanks,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact