[BioC] phastCon-scores

Sean Davis seandavi at gmail.com
Tue Dec 15 23:21:48 CET 2009


On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at bric.dk> wrote:
> Hi all,
>
> I have a small but important challenge set before me, that I've been unable
> to solve. I need to aggregate all phastCon scores for 75-100 nt around all *
> mus* exon splicesites. I've tried different approaches, such as downloading
> the entire mulitz30way phastCon dataset from UCSC (too big to work with
> smoothly), download using intersect with UCSC table browser and Galaxy
> (limits me to 10 million data points, unfortunately), and fetching data
> trough rtracklayer (too slow). Can anyone point me towards an elegant and
> fast way to fetch datapoints for many genomic intervals? With around 22k
> genes, with an average exon count of 8 times 100 nt, it seems I need to be
> able to fetch around 20m data points.
>
> I need to use the data as background in comparison to select upregulated
> exons in a RNA-seq splice study.

Could you do this chromosome-by-chromosome by loading the per-base
data one chromosome at a time from the files into an R vector and then
using normal vector subsetting to get the regions of interest?

Alternatively, with a little work, you could probably also build a
little index file and then use random access to get the data from the
files.

Finally, there are probably some tools in the UCSC browser tool chain
that you could download to deal with conservation data fairly quickly.

Sean



More information about the Bioconductor mailing list