[BioC] table for GenomicRanges

Hervé Pagès hpages at fhcrc.org
Wed Dec 5 21:55:45 CET 2012


Hi,

As Tim pointed out, there is the question of what the output should
be.

Proposal 1: Output is a "table" object.

tableGenomicRanges1 <- function(x)
{
     x_levels <- sort(unique(x))
     y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)),
                                      as.factor(strand(x)),
                                      start(x), width(x),
                                      as.factor(seqnames(x_levels)),
                                      as.factor(strand(x_levels)),
                                      start(x_levels), width(x_levels))
     ans_names <- paste(seqnames(x_levels),
                        "[", start(x_levels), ",", end(x_levels), "]",
                        strand(x_levels), sep="")
     ans <- table(factor(y, levels=seq_along(x_levels)))
     names(ans) <- ans_names
     ans
}

 > tableGenomicRanges1(gr)
    chr1[6,10]+     chr1[1,5]-    chr1[1,10]-     chr1[2,6]-     chr1[3,7]-
            303            333            345            337            345
chr1[101,110]- chr1[102,111]- chr1[103,112]-    chr1[5,10]*    chr2[2,10]+
            327            308            343            308            338
    chr2[3,10]+     chr2[4,8]-     chr2[5,9]-    chr2[6,10]- chr2[104,113]-
            324            344            350            355            342
chr2[105,114]- chr2[106,115]- chr2[107,116]-    chr2[4,10]*    chr3[7,10]+
            354            338            318            336            351
    chr3[8,10]+    chr3[7,11]-    chr3[8,12]-    chr3[9,10]-    chr3[9,13]-
            355            360            357            324            325
   chr3[10,10]-   chr3[10,14]- chr3[108,117]- chr3[109,118]- chr3[110,119]-
            346            307            277            320            330

but this output is not very convenient...

Proposal 2: Output is a GenomicRanges object containing the
sorted "levels". The counts are stored as a metadata col.

tableGenomicRanges2 <- function(x)
{
     ans <- sort(unique(x))
     names(ans) <- mcols(ans) <- NULL
     y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)),
                                      as.factor(strand(x)),
                                      start(x), width(x),
                                      as.factor(seqnames(x_levels)),
                                      as.factor(strand(x_levels)),
                                      start(x_levels), width(x_levels))
     mcols(ans)$count <- tabulate(y, nbins=length(x_levels))
     ans
}

 > tableGenomicRanges2(gr)
GRanges with 30 ranges and 1 metadata column:
        seqnames     ranges strand   |     count
           <Rle>  <IRanges>  <Rle>   | <integer>
    [1]     chr1 [  6,  10]      +   |       303
    [2]     chr1 [  1,   5]      -   |       333
    [3]     chr1 [  1,  10]      -   |       345
    [4]     chr1 [  2,   6]      -   |       337
    [5]     chr1 [  3,   7]      -   |       345
    [6]     chr1 [101, 110]      -   |       327
    [7]     chr1 [102, 111]      -   |       308
    [8]     chr1 [103, 112]      -   |       343
    [9]     chr1 [  5,  10]      *   |       308
    ...      ...        ...    ... ...       ...
   [22]     chr3 [  7,  11]      -   |       360
   [23]     chr3 [  8,  12]      -   |       357
   [24]     chr3 [  9,  10]      -   |       324
   [25]     chr3 [  9,  13]      -   |       325
   [26]     chr3 [ 10,  10]      -   |       346
   [27]     chr3 [ 10,  14]      -   |       307
   [28]     chr3 [108, 117]      -   |       277
   [29]     chr3 [109, 118]      -   |       320
   [30]     chr3 [110, 119]      -   |       330
   ---
   seqlengths:
    chr1 chr2 chr3
    1000 2000 1500

Since the "table" method for GenomicRanges objects should really return
a "table" object it should do tableGenomicRanges1().
So what do we do for tableGenomicRanges2? Should we provide this
functionality thru a new generic function e.g. count()?
Note that tableGenomicRanges2() not only produces more convenient
output, it's also 4x faster or more than tableGenomicRanges1() on
big objects.

Thanks,
H.


On 12/05/2012 11:18 AM, Michael Lawrence wrote:
> I've often thought that this would be a really useful feature, that is,
> counting the combination of seqname, start, end and strand. This would
> follow the convention of the duplicated() method. All someone (Herve) needs
> to do is make an IRanges:::tableIntegerQuads. In the past, I have hacked
> this together by tabulating a string representation. Given the way R
> internalizes strings, that's not an ideal solution.
>
> Michael
>
> On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke at bioinfo.hr> wrote:
>
>> Dear All,
>>
>> Is there an equivalent of the table function for a GenomicRanges object?
>>
>> Best,
>>
>> Vedran
>>
>>
>> --
>> Vedran Franke
>>
>> Bioinformatics Group,
>> Department of Molecular Biology,
>> Faculty of Science, Zagreb
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list