[BioC] CIGAR-aware coverage

Hervé Pagès hpages at fhcrc.org
Fri Feb 28 20:40:59 CET 2014


On 02/28/2014 10:58 AM, Michael Lawrence wrote:
>
>
>
> On Fri, Feb 28, 2014 at 9:47 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Chris,
>
>
>     On 02/27/2014 11:00 PM, chris warth wrote:
>
>             GenomicAlignments is mostly a reorganization of pre-existing
>             functionality.
>             You should be able to call coverage() directly on your
>             GAlignments, or on
>             the BamFile itself, with full support for cigar strings.
>             Also, it is
>             sufficient to call readGAlignments(), i.e., no need for
>             "FromBam".
>
>
>         Thanks for this, I tried calling coverage() on GAlignments
>         before and
>         it wouldn't work for me, but looking closer I now understand the
>         error.
>
>            > class(myalignments)
>            [1] "GAlignments"
>            attr(,"package")
>            [1] "GenomicRanges"
>
>            > coverage(myalignments)
>            Error in coverage(grglist(x, drop.D.ranges = drop.D.ranges),
>         shift = shift,  :
>             error in evaluating the argument 'x' in selecting a method for
>         function  'coverage': Error in validObject(.Object) :
>             invalid class "GRangesList" object: 'mcols(x)' cannot have
>         columns
>         named  "seqnames",  "ranges", "strand", "start", "end", "width",  or
>         "element"
>
>
>         Eliminating the metadata (what = c("pos", "qwidth", "strand"))
>         in the
>         alignments allows coverage() to work now.
>
>         Why in the world does coverage() care if the alignments have
>         strand metadata?
>
>
>     It's not coverage() that cares. It's the internal coercion to
>     GRangesList that happens internally before the coverage is actually
>     computed. This internal coercion tries to propagate the metadata
>     columns, but, for reasons I never really understood, GRanges and
>     GRangesList objects are not allowed to hold a metadata column named
>     "strand".
>
>
> I think this is because there is otherwise ambiguity when coercing to a
> data.frame and also comes up now in the coercion of GRanges to an
> environment. We could just admit warnings in those cases though and
> rename with make.unique().

A warning would be good. I'm not a big fan of automatic renaming though.
Whatever we choose, we want as.data.frame() to handle the metadata cols
consistently. Right now it doesn't:

With a GAlignments:

   > gal
   GAlignments with 3 alignments and 2 metadata columns:
         seqnames strand       cigar    qwidth     start       end     width
            <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
     [1]     seq1      +         36M        36         1        36        36
     [2]     seq1      +         35M        35         3        37        35
     [3]     seq1      +         35M        35         5        39        35
             njunc |      flag   strand
         <integer> | <integer> <factor>
     [1]         0 |        73        +
     [2]         0 |        73        +
     [3]         0 |       137        +
     ---
     seqlengths:
      seq1 seq2
      1575 1584

   > as.data.frame(gal)
     seqnames strand cigar qwidth start end width njunc flag strand
   1     seq1      +   36M     36     1  36    36     0   73      +
   2     seq1      +   35M     35     3  37    35     0   73      +
   3     seq1      +   35M     35     5  39    35     0  137      +

With a GRanges (I artificially created an invalid object):

   > gr
   GRanges with 4 ranges and 1 metadata column:
         seqnames    ranges strand |      strand
            <Rle> <IRanges>  <Rle> | <character>
     [1]     seq1   [1, 36]      + |           -
     [2]     seq1   [3, 37]      + |           -
     [3]     seq1   [5, 39]      + |           -
     [4]     seq1   [6, 41]      + |           -
     ---
     seqlengths:
      seq1 seq2
      1575 1584

   > as.data.frame(gr)
     seqnames start end width strand strand.1
   1     seq1     1  36    36      +        -
   2     seq1     3  37    35      +        -
   3     seq1     5  39    35      +        -
   4     seq1     6  41    36      +        -

H.

>
>     The fix is easy: this internal coercion to GRangesList should
>     not propagate the metadata columns (they're not needed for computing the
>     coverage and they slow down the coercion). I'll fix this today.
>
>     Thanks for the report.
>     H.
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list