[BioC] [somehow-OT] Storing/quickly accessing "genome length" data.

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Feb 9 22:59:53 CET 2011


On Wed, Feb 9, 2011 at 4:25 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Feb 9, 2011 at 1:08 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>>
>> Hi,
>>
>> I guess a lot of us have this problem: I'm storing "genome long"
>> integer/doubles vectors for each position along each chromosome.
>>
>> I want to quickly access parts of these vectors in a manner quite
>> similar/convenient/efficient to how we can quickly access the reads in
>> a given region of a BAM file. I'm curios what you folks are using to
>> store this type of info?
>>
>> Currently I just have RData objects of Rle's or XIntegers, etc. for
>> each strand of each chromosome. I'll load these data files, query the
>> info over the ranges I want, then junk the (usually large) vector I
>> just loaded. It's not the best, but it works.
>>
>> In the bioinformatics world, I guess these data are best stored as
>> bigWig files, yes? And AFAIK, there's no (convenient or otherwise) way
>> to query bigWigs from within R/Bioc, right?
>>
>
> Actually, rtracklayer can query bigWigs. It's very efficient.

Oh, I see ... sorry I missed that. I couldn't find info on it when
searching through rtracklayer's vignette for "bigwig." I missed the
BigWigSelection documentation.

And ... wow, I can create a bigWig via the export.bw, nice. I'll have
to play with this a bit.

>> Then I wonder if storing these in hdf/netcdf files isn't actually the
>> way to go  ... and if so, why not go whole-hog and work on a bioc
>> interface to the somehow-defined biohdf format?
>>
>> Any thoughts?
>>
>
> This is also a good idea, especially if you have data for many samples.

Yes, that. But I'm also thinking of one such file per genome "release"
I'm working with (things like conservation, mappability, etc. for
hg18, hg19, mm9, etc).

> There's a group of us here at Genentech looking to improve upon the netcdf4
> support in R.

Interesting. Is your work "out in the open", or an internal project?

> This is the first I've heard of biohdf. Sounds kind of
> half-baked though.

I also haven't found any updated information since whatever
document/webpage is up from last spring (March or April(?)). I reckon
it's being worked/improved on somewhere, though. Perhaps sticking with
the (more) standard netcdf4 is the right way to go, anyway.

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list