[BioC] phastCon-scores

Tue Dec 15 23:35:55 CET 2009

On Tue, Dec 15, 2009 at 5:21 PM, Sean Davis <seandavi at gmail.com> wrote:
> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at bric.dk> wrote:
>> Hi all,
>>
>> I have a small but important challenge set before me, that I've been unable
>> to solve. I need to aggregate all phastCon scores for 75-100 nt around all *
>> mus* exon splicesites. I've tried different approaches, such as downloading
>> the entire mulitz30way phastCon dataset from UCSC (too big to work with
>> smoothly), download using intersect with UCSC table browser and Galaxy
>> (limits me to 10 million data points, unfortunately), and fetching data
>> trough rtracklayer (too slow). Can anyone point me towards an elegant and
>> fast way to fetch datapoints for many genomic intervals? With around 22k
>> genes, with an average exon count of 8 times 100 nt, it seems I need to be
>> able to fetch around 20m data points.
>>
>> I need to use the data as background in comparison to select upregulated
>> exons in a RNA-seq splice study.
>
> Could you do this chromosome-by-chromosome by loading the per-base
> data one chromosome at a time from the files into an R vector and then
> using normal vector subsetting to get the regions of interest?

OK.  I looked at the files and I don't think it will work without some
cleverness.  The two methods below are still possible, though.

Sean

> Alternatively, with a little work, you could probably also build a
> little index file and then use random access to get the data from the
> files.
>
> Finally, there are probably some tools in the UCSC browser tool chain
> that you could download to deal with conservation data fairly quickly.
>
> Sean
>