[BioC] IRanges package: findOverlaps on blobs

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue May 31 18:05:00 CEST 2011


Hi,

On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad <fahim.md at gmail.com> wrote:
> Hello
> I have a file containing a list of aligned Refseq identifier in the
> following format.  I want to convert this file into RangedData format in
> such a way that  "findOverlaps" function be as efficient as possible. I know
> how to do this when each of the identifiers has just a start and an end
> coordinate. IRanges package is very efficient in finding the overlapping
> intervals in such case. But when the structure is in the form of blobs (as
> shown below in the last three fields), I am not sure how to convert this
> structure into RangedData format and how to subsequently call the
> "findOverlaps" function.
>
> RefSeqID               targetName    strand        blockSizes
> queryStart             targetStart
> XM_001065892.1    chr4                 +            127,986,
> 0,127,                  124513961,124514706,
> XM_578205.2         chr2                  -             535,137,148,
>   0,535,672,            155875533,155879894,155895543,
> NM_012543.2         chr1                 +             506,411,212,494,
> 0,506,917,1129,    96173572,96174920,96176574,96177991,

I guess the answer lies in how you want overlaps to be calculated.

Is each "block" for each RefSeqID treated individually, or do you want
one overlap to count for all of them?

If the answer is the latter, you might consider putting the intervals
into a GRangesList object, where each element in the list is a GRanges
object that has all the ranges for the particular refseq id ... if you
want them all individually, then you just have to parse it into a
GRanges object.

If you're asking *how* to parse the file, there are many ways. I'd
maybe use a read.table to get this thing into its respective columns,
then iterate over each row converting it into a GRanges object that
has as many ranges as "blocks" -- look at ?strsplit if you don't know
how to do that.

Lastly -- how are these ranges defined?
Is the first row supposed to be turned into:

chr4 (124513961) -- (124513961 + 0 + 127)
chr4 (124514706 + 127) -- (124514706 + 127 + 986)

or?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list