[BioC] obtaining a complete global alignment via pairwiseAlignment

Thu Jun 20 07:17:16 CEST 2013

Thanks Hervé.  Adding such an option 'full.alignment' would be extremely 
useful to me, since it would let me get access to the complete global 
alignment with minimal string parsing.

On 6/19/13 11:33 AM, Hervé Pagès wrote:
> Hi guys,
>
> It was an early decision of the author of pairwiseAlignment() to not
> display the indels that occur at the ends of the aligned sequences.
> The consequence of this is that the alignment "looks" like it's a
> local alignment even though it's global.
>
> Using writePairwiseAlignments() to display the alignments disambiguates
> this.
>
> With your 1st example:
>
>    s1 <- DNAString("AAGGAA")
>    s2 <- DNAString("GG")
>    xx <- pairwiseAlignment(s1, s2, type="global")
>
> Then:
>
>  > xx
> Global PairwiseAlignmentsSingleSubject (1 of 1)
> pattern: [3] GG
> subject: [1] GG
> score: -32.03649
>
>  > writePairwiseAlignments(xx)
> ########################################
> # Program: Biostrings (version 2.29.7), a Bioconductor package
> # Rundate: Wed Jun 19 11:15:27 2013
> ########################################
> #=======================================
> #
> # Aligned_sequences: 2
> # 1: P1
> # 2: S1
> # Matrix: NA
> # Gap_penalty: 14.0
> # Extend_penalty: 4.0
> #
> # Length: 6
> # Identity:       2/6 (33.3%)
> # Similarity:    NA/6 (NA%)
> # Gaps:           4/6 (66.7%)
> # Score: -32.03649
> #
> #
> #=======================================
>
> P1                 1 AAGGAA      6
>                         ||
> S1                 1 --GG--      2
>
>
> #---------------------------------------
> #---------------------------------------
>
> With your 2nd example:
>
>    s1 <- DNAString("AGGGG")
>    s2 <- DNAString("GGGGA")
>    xx <- pairwiseAlignment(s1, s2, gapOpening=-1)
>
> Then:
>
>  > xx
> Global PairwiseAlignmentsSingleSubject (1 of 1)
> pattern: [2] GGGG
> subject: [1] GGGG
> score: -2.072976
>
>  > writePairwiseAlignments(xx)
> ########################################
> # Program: Biostrings (version 2.29.7), a Bioconductor package
> # Rundate: Wed Jun 19 11:19:40 2013
> ########################################
> #=======================================
> #
> # Aligned_sequences: 2
> # 1: P1
> # 2: S1
> # Matrix: NA
> # Gap_penalty: 5.0
> # Extend_penalty: 4.0
> #
> # Length: 6
> # Identity:       4/6 (66.7%)
> # Similarity:    NA/6 (NA%)
> # Gaps:           2/6 (33.3%)
> # Score: -2.072976
> #
> #
> #=======================================
>
> P1                 1 AGGGG-      5
>                        ||||
> S1                 1 -GGGGA      5
>
>
> #---------------------------------------
> #---------------------------------------
>
> It's sad though that 'aligned(pattern(xx))' and 'aligned(subject(xx))'
> also trim those indels that are at the ends because they are really
> part of the aligned sequences. However this behavior has been around for
> so long now that I'm not sure we want to touch it. What we could to
> is add an extra argument to the aligned() getter, say 'full.alignment',
> with default to FALSE, to let the user extract the whole thing.
>
> Would that be useful?
>
> Cheers,
> H.
>
>
> On 06/15/2013 10:19 AM, Martin Morgan wrote:
>> On 06/15/2013 09:18 AM, Robert K. Bradley wrote:
>>> Thanks Martin.  as.character (aligned (...)) does return a gapped
>>> string for one
>>> of the two aligned sequences, but even that command trims off unaligned
>>> characters.  For example:
>>> For example:
>>>  > s1 = DNAString ("AGGGG")
>>>  > s2 = DNAString ("GGGGA")
>>>  > xx = pairwiseAlignment (s1, s2, gapOpening = -1)
>>>  > as.character (xx)
>>> [1] "GGGG-"
>>> while the whole gapped string is "AGGGG-".
>>>
>>> I apologize in advance if I'm missing something obvious here.
>>
>> Not really doing much better on my end; I'll offer this up then step
>> aside until someone with more expertise chimes in. Maybe this is closer
>> to what you're after, padding aligned(xx) with and left or right
>> overhang of pattern
>>
>>      lraligned <- function(pattern, subject, ...) {
>>          xx <- pairwiseAlignment (pattern, subject, gapOpening = -1)
>>          aln <- aligned(xx)
>>          lpad <- substr(pattern, 1L, start(pattern(xx)) - 1L)
>>          rpad <- substr(pattern, end(pattern(xx)) + 1L, nchar(pattern))
>>          paste0(lpad, aln, rpad)
>>      }
>>
>>  > lraligned(DNAString ("AGGGG"), DNAString ("GGGGA"), type="global")
>> [1] "AGGGG-"
>>  > lraligned(DNAString ("GGGGA"), DNAString ("AGGGG"), type="global")
>> [1] "-GGGGA"
>>  > lraligned(DNAString("AAGGAA"), DNAString("GG"))
>> [1] "AAGGAA"
>>  > lraligned(DNAString("GG"), DNAString("AAGGAA"))
>> [1] "--GG--"
>>
>> Martin
>>
>>>
>>> Best,
>>> Rob
>>>
>>> On 6/15/13 6:45 AM, Martin Morgan wrote:
>>>> On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm interested in using the pairwiseAlignment function in Biostrings
>>>>> to perform simple alignments of short sequences.  I noticed that even
>>>>> the global alignment mode returns an alignment with gapped ends
>>>>> trimmed off.  For example:
>>>>>
>>>>>> s1 = DNAString ("AAGGAA")
>>>>>> s2 = DNAString ("GG")
>>>>>> pairwiseAlignment (s1, s2, type = "global")
>>>>> Global PairwiseAlignmentsSingleSubject (1 of 1)
>>>>> pattern: [3] GG
>>>>> subject: [1] GG
>>>>> score: -32.03649
>>>>>
>>>>> Is it possible to obtain the complete alignment including the
>>>>> (possibly) gapped ends?  In this case, that would be:
>>>>> --GG--
>>>>> AAGGAA
>>>>> For my purposes, having the complete gapped alignment is important.
>>>>
>>>> Hi Robert -- Not an expert on pairwiseAlignment, but I think the return
>>>> value is a class that contains more information than it is showing
>>>>
>>>>      > xx = pairwiseAlignment(s2, s1, type="global")
>>>>      > class(xx)
>>>>      [1] "PairwiseAlignmentsSingleSubject"
>>>>      attr(,"package")
>>>>      [1] "Biostrings"
>>>>      > class?PairwiseAlignmentsSingleSubject
>>>>      > aligned(xx)
>>>>        A DNAStringSet instance of length 1
>>>>          width seq
>>>>      [1]     6 --GG--
>>>>
>>>> Martin
>>>>
>>>>>
>>>>> My question seems related to a previous post:
>>>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html
>>>>>
>>>>> Thank you in advance.
>>>>>
>>>>> Best wishes,
>>>>> Rob
>>>>>
>>>>
>>>>
>>
>>
>