[BioC] GappedAlignmentPairs requests

Wed Jul 11 01:22:46 CEST 2012

Hi Michael,

On 07/10/2012 03:45 AM, Michael Lawrence wrote:
>
>
> On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>
>     On 06/29/2012 05:06 AM, Michael Lawrence wrote:
>
>         Hi (Herve),
>
>         I have been playing around with GappedAlignmentPairs and found
>         it pretty
>         useful. Some things that would be nice:
>
>         1) A method for getting the observed fragment length (would need
>         to be NA
>         when pairs are on different chromosomes). This should at least
>         have an
>         option (if not always) to exclude the N cigar runs from the
>         calculation. Or
>         it could just take it from the TLEN (isize) column in the BAM
>         file (which
>         at least in the case of gsnap excludes the Ns).
>
>
>      From the SAM Spec:
>
>     TLEN: signed observed Template LENgth. If all segments are mapped to
>     the same reference, the unsigned observed template length equals the
>     number of bases from the leftmost mapped base to the rightmost mapped
>     base. [...] It is set as 0 for single-segment template or when the
>     information is unavailable.
>
>     5 notes:
>
>     (1) It seems that most of the time this information is not available
>          in the BAM files produced by the aligners (at least not in those
>          I've seen so far).
>
>     (2) I wonder if the above definition really makes sense for a
>          paired-end read with the 2 ends aligned to the *same* strand.
>          It's actually not clear to me how to interpret such a situation
>          and why the aligner is producing this kind of paired alignments.
>          Seems to reveal a template coming from a transcript that mixes
>          exons from both strands.
>
>
>
> Yes, for non-concordant pairs, like ends that do not map to opposite
> strands, or that map to different chromosomes or very far away on the
> same chromosome, this would not be easy to calculate. Presumably, there
> is some sort of genomic rearrangement or trans-splicing, and the precise
> break-point would need to be determined.
>
>     (3) Taking "the number of bases from the leftmost mapped base
>          to the rightmost mapped base" is ok most of the times but is
>          questionable in some situations e.g. when the first segment
>          is not mapped upstream of the last segment. I'd rather set
>          TLEN to NA in that case too, like when the 2 segments are not
>          mapped to the same reference, or when they are mapped to the
>          same strand.
>
>
> This makes sense to me.
>
>     (4) The above is just a definition so it is what it is (and is not
>          inherently correct or incorrect) but it seems to me that
>          excluding the N gaps would be closer to the intuitive notion
>          of "observed template length".
>
>     (5) The above definition ignores soft clipping.
>
>
> Right. The definition is not that helpful, and Tom has already decided
> to violate it with gsnap, which counts the soft-clipped regions and
> discounts the N regions when populating the TLEN field. So we're
> currently happy just taking TLEN from the BAM file.
>
>     So here is a proposal that ignores the N gaps but takes into account
>     soft clipping. 'galp' is GappedAlignmentPairs object. Note that we
>     don't need to deal with the situation where the 2 ends are not
>     mapped to the same reference, or are mapped to the same strand, because
>     those pairs are discarded (with a warning) when the GappedAlignmentPairs
>     object is made (usually with readGappedAlignmentPairs()):
>
>     otlen <- function(galp)
>     {
>        lgal <- left(galp)
>        rgal <- right(galp)
>        lend <- end(lgal)
>        rstart <- start(rgal)
>        ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L
>
>        ## Set to NA when the first and last ends are not in an
>        ## upstream/downstream configuration:
>        not_upstream_downstream <- rstart <= lend
>        ans[not_upstream_downstream] <- NA_integer_
>
>        ans
>     }
>
>     If we want to be a little less conservative about the first and
>     last ends being in an upstream/downstream configuration, we
>     can replace the computation of 'not_upstream_downstream' with:
>
>        lstart <- start(lgal)
>        rend <- end(rgal)
>        not_upstream_downstream <- rstart < lstart | rend < lend
>
>
>
>         2) A method called "introns" or something that returns the N
>         regions as a
>         GRangesList, as a complement to grglist().
>
>
>     The special gap between the 2 ends is a little bit problematic.
>     How should we deal with it given that (1) it has a different
>     meaning than the N gaps (i.e. it does not in general correspond to an
>     intron, only by chance), and (2) it's not always here (e.g. when
>     the first and last ends are not in an upstream/downstream
>     configuration)? Of course (2) could always be handled by removing
>     the problematic alignment pairs.
>
>
> Well, my request was to discard the inter-read gap, so that I would just
> have the splices as called by the aligner.

I see. Then this can be done with something like this:

## First we need an optimized mendoapply(c, x, y) for CompressedList
## objects (probably worth adding to IRanges, could also be called
## elementCombine() and take an arbitrary number of objects).
pcombine <- function(x, y)
{
   if (!is(x, "CompressedList") ||
       class(x) != class(y) ||
       length(x) != length(y))
     stop("'x' and 'y' must be CompressedList objects ",
          "of the same class and length")
   tmp <- c(x, y)[GenomicRanges:::.makePickupIndex(length(x))]
   ans <- GenomicRanges:::.shrinkByHalf(tmp)
   elementMetadata(ans) <- NULL
   ans
}

## Then the introns() function itself for GappedAlignmentPairs.
introns <- function(x)
{
   if (!is(x, "GappedAlignmentPairs"))
     stop("'x' must be a GappedAlignmentPairs object")

   ## Extract introns (i.e. N gaps) from the first ends.
   Fgal <- first(x)
   Fgrl <- grglist(Fgal, order.as.in.query=TRUE)
   Fintrons <- psetdiff(granges(Fgal), Fgrl)

   ## Extract introns (i.e. N gaps) from the last ends.
   Lgal <- last(x, invert.strand=TRUE)
   Lgrl <- grglist(Lgal, order.as.in.query=TRUE)
   Lintrons <- psetdiff(granges(Lgal), Lgrl)

   ## Combine the introns from first and last ends into a
   ## single GRangesList object.
   ans <- pcombine(Fintrons, Lintrons)
   names(ans) <- names(x)
   ans
}

Try it:

   my_introns <- introns(galp)
   stopifnot(identical(elementLengths(my_introns), ngap(first(galp)) + 
ngap(last(galp))))

> It would be nice to have
> another function that would return the inter-read gap. I actually have
> functions that do all these things, but they're not very elegant or
> robust to all of these situations.
>
>
>         Such a method would also be nice
>         on GappedAlignments.
              ^^^^^^^^^^^^^^^^
>
>
>     This one is easier ('gal' being a GappedAlignments object):
                                         ^^^^^^^^^^^^^^^^
>
>        psetdiff(granges(gal), grglist(gal))
>
>     Also I think you actually sent us a "gaps" method for GRangesList
>     recently that we are about to add to the GenomicRanges package.
>     Once we have it, the user will be able to do:
>
>        gaps(grglist(gal))
>
>     as an equivalent way to do the above.
>
>
> Right, but as I said, we want to ignore the inter-read gap, which takes
> a bit more work. Right now, I do the above and then setdiff off the
> inter-read gap.

The above is for a GappedAlignments object so there is no inter-read
gap.

HTH,
H.

>
>     Thanks,
>
>     H.
>
>
>         Thanks,
>         Michael
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319