[BioC] Fwd: BioStrings PairwiseAlignment deletion() insertion() function

Hervé Pagès hpages at fhcrc.org
Sun Sep 29 06:44:46 CEST 2013


On 09/28/2013 02:35 PM, Marcin Imielinski wrote:
> Thanks, but I'm still a bit confused.  str1 (pattern) has width 101 and
> str2 (subject) has width 404.   If deletion() specifies pattern gaps as
> coordinates on str1 then why is the 9th deletion range 94-133?
>
> I suppose that the "end" coordinate of these indel ranges are irrelevant
> - you are just specifying a gap of width 40 at pattern position 94, and
> the 133 "endpoint" is just an artifact of using IRanges to specify these
> gaps.

Exactly.

>
> It would seem to be more intuitive to specify subject / pattern gaps as
> ranges on the alignment - just my two cents.

I agree but these were decisions taken a long time ago and
unfortunately we're kind of stuck with them. On way we could try to
improve things is by adding an extra argument, e.g. 'on.alignment'
(that would need to be FALSE by default) to deletion() and insertion(),
to let the user request the ranges to be reported with respect
to the alignments. Another way to go, which has my preference, would
be to use CIGARs to represent the indels in pairwise alignments. That's
what SAMtools does and we already have efficient/well-tested CIGAR
utilities in GenomicRanges (see ?`cigar-utils`, they've been
refactored in BioC devel) for converting cigars to ranges along
the unaligned patterns (query space), or the unaligned subject
(reference space), or the alignments ("pairwise space"). Those
utilities are used in several places internally e.g. when converting
a GAlignments object into a GRanges or GRangesList object or in
sequenceLayer(). There is an oportunity to reuse them for
PairwiseAlignments object too and to make those objects look more
like GAlignments objects by providing the CIGARs of the alignments
(via a cigar() getter like for Alignments objects). That would also
make PairwiseAlignments objects much more compact in memory.

>
> With that said, I'd like to thank you and the rest of the FHCRC team for
> these fantastic packages (Biostrings, IRanges, GenomicRanges, etc)!

You're welcome. Glad you like them and thanks for your feedback!

H.

>
>
> On Fri, Sep 27, 2013 at 8:22 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Marcin,
>
>
>     On 09/27/2013 07:48 AM, Marcin Imielinski wrote:
>
>         Sorry, forgot to cc you.  Thanks in advance!
>
>         ---------- Forwarded message ----------
>         From: *Marcin Imielinski* <marcin at broad.mit.edu
>         <mailto:marcin at broad.mit.edu>
>         <mailto:marcin at broad.mit.edu <mailto:marcin at broad.mit.edu>>>
>         Date: Fri, Sep 27, 2013 at 10:40 AM
>         Subject: BioStrings PairwiseAlignment deletion() insertion()
>         function
>         To: bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org>
>         <mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>
>
>
>         Hi -
>
>         I'm confused about the output of deletion(pa) and insertion(pa)
>         functions for pa = pairwiseAlignment().   My understanding is
>         that they
>         should output IRanges corresponding to gaps in the pattern
>         (deletion())
>         and gaps in the subject (insertion()) in terms of alignment
>         coordinates.
>
>         However, it appears that the outputted ranges can overlap.  For
>         example,
>         the alignment (below) of a 101 letter pattern and 404 letter
>         subject.
>         The deletion ranges overlap.   What sequence positions do these
>         ranges
>         refer to (pattern, subject, or alignment)?
>
>         Is this is a bug or am I misinterpreting this output?
>
>
>     It's a questionable design choice but the ranges describing the
>     deletions are reported with respect to the "original" (aka unaligned)
>     pattern, not to the aligned pattern. For example the 1st range you
>     see in 'deletion(pa)' means there is a deletion of 3 letters (when
>     going from subject to pattern) and that this deletion starts at position
>     4 in the original pattern. With such conventions, the *ranges*
>     describing the deletions can overlap but the deletions don't, which
>     I must admit is very confusing.
>
>     As you noticed, those ranges can be shifted to make them refer to the
>     aligned pattern:
>
>        > offset <- c(0L, cumsum(head(width(deletion(pa)__[[1]]), n=-1)))
>        > shift(deletion(pa)[[1]], offset)
>
>        IRanges of length 9
>            start end width
>        [1]     4   6     3
>        [2]    21  39    19
>        [3]    49  76    28
>        [4]    84 128    45
>        [5]   135 172    38
>        [6]   183 186     4
>        [7]   195 204    10
>        [8]   212 215     4
>        [9]   245 284    40
>
>     Hope this helps,
>     H.
>
>
>
>         Thanks
>         Marcin
>
>         ##############################__##############################__##
>
>           > str[1]
>         [1]
>         "__TCCTTGCACATTGATAAGTTCACATCTGAA__ATTTGCATGACATAAACATACAGTTGAGAA__GGAGAGAACGTATGCCCTATGGTAAATATT__GACATTTTAAA"
>           > str[2]
>         [1]
>         "__CTGGGCTTTCGATGAAATAGTTCATTTATC__TGTGGGTAGATATTACTTACTGGTTGAGTT__AAACTGGGTTAAACATCAATTCTATTTCCA__TTTTTCATTTTTATAAATAGGTACTGAGAA__TCTTTGTTCATATAAATAGATGGATAGGAT__TAGCCACTTCTTTGAATTTCTTTTTCAAGT__TTCATGCCAAGATTCACATCATAACACATG__TAACTGCATGTCTGGATGGAGAACAGATGT__ACCTATGCAGCGGCAGGGACATCAACACTC__TCACTGATGAATTGGCCGAGGAATGAGGAA__TAGCACAAATCAGCTACGGAACATTGACAA__ACTGGGAGCTAAACTTTGCTTCATGCCTGT__GAGGCAGTATTTTGATGAGCGGTGGATGCC__CAGTGCTTCCTTGT"
>           >
>           > pa = pairwiseAlignment(str[1], str[2])
>           > deletion(pa)
>         IRangesList of length 1
>         [[1]]
>         IRanges of length 9
>               start end width
>         [1]     4   6     3
>         [2]    18  36    19
>         [3]    27  54    28
>         [4]    34  78    45
>         [5]    40  77    38
>         [6]    50  53     4
>         [7]    58  67    10
>         [8]    65  68     4
>         [9]    94 133    40
>
>           > insertion(pa)
>         IRangesList of length 1
>         [[1]]
>         IRanges of length 1
>               start end width
>         [1]   234 235     2
>
>           > nchar(pa)
>         [1] 292
>
>           > as.character(aligned(pattern(__pa)))
>         [1]
>         "TCC---TTGCACATTGATAA---------__----------GTTCACATC-----------__-----------------TGAAATT------__------------------------------__---------TGCATG---------------__-----------------------__ACATAAACAT----ACAGTTGA--------__--GAAGGAG----__AGAACGTATGCCCTATGGTAAATATTGAC-__------------------------------__---------ATTTTAAA"
>           > as.character(aligned(subject(__pa)))
>         [1]
>         "__TCCATTTTTCATTTTTATAAATAGGTACTG__AGAATCTTTGTTCATATAAATAGATGGATA__GGATTAGCCACTTCTTTGAATTTCTTTTTC__AAGTTTCATGCCAAGATTCACATCATAACA__CATGTAACTGCATGTCTGGATGGAGAACAG__ATGTACCTATGCAGCGGCAGGGACATCAAC__ACTCTCACTGATGAATTGGCCGAGGAATGA__GGAATAGCACAAATCAGCTACGG--__AACATTGACAAACTGGGAGCTAAACTTTGC__TTCATGCCTGTGAGGCAGTATTTTGAT"
>
>         ### here is what I would expect deletion(pa) to output ...
>         notice that
>         it resembles
>         ### the above deletion(pa) output with a shift corresponding to
>         cumsum(width)
>         ### is this a bug?
>
>           > as(which(strsplit(as.__character(aligned(subject(pa))__),
>         '')[[1]] ==
>         "-"), 'IRanges')
>         IRanges of length 9
>               start end width
>         [1]     4   6     3
>         [2]    21  39    19
>         [3]    49  76    28
>         [4]    84 128    45
>         [5]   135 172    38
>         [6]   183 186     4
>         [7]   195 204    10
>         [8]   212 215     4
>         [9]   245 284    40
>
>
>           > sessionInfo()
>         R version 3.0.0 (2013-04-03)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>
>         locale:
>         [1] C
>
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>           methods
>         [8] base
>
>         other attached packages:
>         [1] Biostrings_2.28.0    IRanges_1.18.4       BiocGenerics_0.6.0
>         [4] BiocInstaller_1.10.3 multicore_0.1-7
>
>         loaded via a namespace (and not attached):
>         [1] stats4_3.0.0 tools_3.0.0
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list