[BioC] GappedAlignmentPairs requests

Hervé Pagès hpages at fhcrc.org
Wed Jul 11 08:57:20 CEST 2012


Hi Michael,

On 07/10/2012 09:01 PM, Michael Lawrence wrote:
> Looks great. Will it land in GenomicRanges? Also, a function for getting
> the inter-read gap would be nice, even if trivial.

Should not be hard. It will have to use 0-width ranges in the output
when the first/last ends are not in a upstream/downstream configuration
though. BTW, is "inter-read gap" the commonly used term for that gap?
What would be a good name for this function?

I've just added introns() for GappedAlignments and GappedAlignmentPairs
to GenomicRanges 1.9.31.

For my otlen() proposal (observed template length), I want to do some
comparison with what Tom is putting in the TLEN field of the BAM files
produced by gsnap. I have some big gsnap-generated BAM files with
incredibly complicated cigars (lots of gaps and soft clipping), they'll
be perfect for testing. If everything looks good, I'll add otlen()
to GenomicRanges (will be renamed tlen()).

H.

>
> Thanks,
> Michael
>
> On Tue, Jul 10, 2012 at 4:22 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>
>     On 07/10/2012 03:45 AM, Michael Lawrence wrote:
>
>
>
>         On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>              Hi Michael,
>
>
>              On 06/29/2012 05:06 AM, Michael Lawrence wrote:
>
>                  Hi (Herve),
>
>                  I have been playing around with GappedAlignmentPairs
>         and found
>                  it pretty
>                  useful. Some things that would be nice:
>
>                  1) A method for getting the observed fragment length
>         (would need
>                  to be NA
>                  when pairs are on different chromosomes). This should
>         at least
>                  have an
>                  option (if not always) to exclude the N cigar runs from the
>                  calculation. Or
>                  it could just take it from the TLEN (isize) column in
>         the BAM
>                  file (which
>                  at least in the case of gsnap excludes the Ns).
>
>
>               From the SAM Spec:
>
>              TLEN: signed observed Template LENgth. If all segments are
>         mapped to
>              the same reference, the unsigned observed template length
>         equals the
>              number of bases from the leftmost mapped base to the
>         rightmost mapped
>              base. [...] It is set as 0 for single-segment template or
>         when the
>              information is unavailable.
>
>              5 notes:
>
>              (1) It seems that most of the time this information is not
>         available
>                   in the BAM files produced by the aligners (at least
>         not in those
>                   I've seen so far).
>
>              (2) I wonder if the above definition really makes sense for a
>                   paired-end read with the 2 ends aligned to the *same*
>         strand.
>                   It's actually not clear to me how to interpret such a
>         situation
>                   and why the aligner is producing this kind of paired
>         alignments.
>                   Seems to reveal a template coming from a transcript
>         that mixes
>                   exons from both strands.
>
>
>
>         Yes, for non-concordant pairs, like ends that do not map to opposite
>         strands, or that map to different chromosomes or very far away
>         on the
>         same chromosome, this would not be easy to calculate.
>         Presumably, there
>         is some sort of genomic rearrangement or trans-splicing, and the
>         precise
>         break-point would need to be determined.
>
>              (3) Taking "the number of bases from the leftmost mapped base
>                   to the rightmost mapped base" is ok most of the times
>         but is
>                   questionable in some situations e.g. when the first
>         segment
>                   is not mapped upstream of the last segment. I'd rather set
>                   TLEN to NA in that case too, like when the 2 segments
>         are not
>                   mapped to the same reference, or when they are mapped
>         to the
>                   same strand.
>
>
>         This makes sense to me.
>
>              (4) The above is just a definition so it is what it is (and
>         is not
>                   inherently correct or incorrect) but it seems to me that
>                   excluding the N gaps would be closer to the intuitive
>         notion
>                   of "observed template length".
>
>              (5) The above definition ignores soft clipping.
>
>
>         Right. The definition is not that helpful, and Tom has already
>         decided
>         to violate it with gsnap, which counts the soft-clipped regions and
>         discounts the N regions when populating the TLEN field. So we're
>         currently happy just taking TLEN from the BAM file.
>
>              So here is a proposal that ignores the N gaps but takes
>         into account
>              soft clipping. 'galp' is GappedAlignmentPairs object. Note
>         that we
>              don't need to deal with the situation where the 2 ends are not
>              mapped to the same reference, or are mapped to the same
>         strand, because
>              those pairs are discarded (with a warning) when the
>         GappedAlignmentPairs
>              object is made (usually with readGappedAlignmentPairs()):
>
>              otlen <- function(galp)
>              {
>                 lgal <- left(galp)
>                 rgal <- right(galp)
>                 lend <- end(lgal)
>                 rstart <- start(rgal)
>                 ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L
>
>                 ## Set to NA when the first and last ends are not in an
>                 ## upstream/downstream configuration:
>                 not_upstream_downstream <- rstart <= lend
>                 ans[not_upstream_downstream] <- NA_integer_
>
>                 ans
>              }
>
>              If we want to be a little less conservative about the first and
>              last ends being in an upstream/downstream configuration, we
>              can replace the computation of 'not_upstream_downstream' with:
>
>                 lstart <- start(lgal)
>                 rend <- end(rgal)
>                 not_upstream_downstream <- rstart < lstart | rend < lend
>
>
>
>                  2) A method called "introns" or something that returns
>         the N
>                  regions as a
>                  GRangesList, as a complement to grglist().
>
>
>              The special gap between the 2 ends is a little bit problematic.
>              How should we deal with it given that (1) it has a different
>              meaning than the N gaps (i.e. it does not in general
>         correspond to an
>              intron, only by chance), and (2) it's not always here (e.g.
>         when
>              the first and last ends are not in an upstream/downstream
>              configuration)? Of course (2) could always be handled by
>         removing
>              the problematic alignment pairs.
>
>
>         Well, my request was to discard the inter-read gap, so that I
>         would just
>         have the splices as called by the aligner.
>
>
>     I see. Then this can be done with something like this:
>
>     ## First we need an optimized mendoapply(c, x, y) for CompressedList
>     ## objects (probably worth adding to IRanges, could also be called
>     ## elementCombine() and take an arbitrary number of objects).
>     pcombine <- function(x, y)
>     {
>        if (!is(x, "CompressedList") ||
>            class(x) != class(y) ||
>            length(x) != length(y))
>          stop("'x' and 'y' must be CompressedList objects ",
>               "of the same class and length")
>        tmp <- c(x, y)[GenomicRanges:::.__makePickupIndex(length(x))]
>        ans <- GenomicRanges:::.shrinkByHalf(__tmp)
>        elementMetadata(ans) <- NULL
>        ans
>     }
>
>     ## Then the introns() function itself for GappedAlignmentPairs.
>     introns <- function(x)
>     {
>        if (!is(x, "GappedAlignmentPairs"))
>          stop("'x' must be a GappedAlignmentPairs object")
>
>        ## Extract introns (i.e. N gaps) from the first ends.
>        Fgal <- first(x)
>        Fgrl <- grglist(Fgal, order.as.in.query=TRUE)
>        Fintrons <- psetdiff(granges(Fgal), Fgrl)
>
>        ## Extract introns (i.e. N gaps) from the last ends.
>        Lgal <- last(x, invert.strand=TRUE)
>        Lgrl <- grglist(Lgal, order.as.in.query=TRUE)
>        Lintrons <- psetdiff(granges(Lgal), Lgrl)
>
>        ## Combine the introns from first and last ends into a
>        ## single GRangesList object.
>        ans <- pcombine(Fintrons, Lintrons)
>        names(ans) <- names(x)
>        ans
>     }
>
>     Try it:
>
>        my_introns <- introns(galp)
>        stopifnot(identical(__elementLengths(my_introns),
>     ngap(first(galp)) + ngap(last(galp))))
>
>
>         It would be nice to have
>         another function that would return the inter-read gap. I
>         actually have
>         functions that do all these things, but they're not very elegant or
>         robust to all of these situations.
>
>
>                  Such a method would also be nice
>                  on GappedAlignments.
>
>                   ^^^^^^^^^^^^^^^^
>
>
>
>              This one is easier ('gal' being a GappedAlignments object):
>
>                                              ^^^^^^^^^^^^^^^^
>
>
>                 psetdiff(granges(gal), grglist(gal))
>
>              Also I think you actually sent us a "gaps" method for
>         GRangesList
>              recently that we are about to add to the GenomicRanges package.
>              Once we have it, the user will be able to do:
>
>                 gaps(grglist(gal))
>
>              as an equivalent way to do the above.
>
>
>         Right, but as I said, we want to ignore the inter-read gap,
>         which takes
>         a bit more work. Right now, I do the above and then setdiff off the
>         inter-read gap.
>
>
>     The above is for a GappedAlignments object so there is no inter-read
>     gap.
>
>     HTH,
>     H.
>
>
>              Thanks,
>
>              H.
>
>
>                  Thanks,
>                  Michael
>
>                           [[alternative HTML version deleted]]
>
>                  ___________________________________________________
>                  Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                  Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>              --
>              Hervé Pagès
>
>              Program in Computational Biology
>              Division of Public Health Sciences
>              Fred Hutchinson Cancer Research Center
>              1100 Fairview Ave. N, M1-B514
>              P.O. Box 19024
>              Seattle, WA 98109-1024
>
>              E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>              Phone:  (206) 667-5791
>              Fax:    (206) 667-1319
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list