[BioC] obtaining a complete global alignment via pairwiseAlignment

Wed Jun 19 20:33:54 CEST 2013

Hi guys,

It was an early decision of the author of pairwiseAlignment() to not
display the indels that occur at the ends of the aligned sequences.
The consequence of this is that the alignment "looks" like it's a
local alignment even though it's global.

Using writePairwiseAlignments() to display the alignments disambiguates
this.

With your 1st example:

   s1 <- DNAString("AAGGAA")
   s2 <- DNAString("GG")
   xx <- pairwiseAlignment(s1, s2, type="global")

Then:

 > xx
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [3] GG
subject: [1] GG
score: -32.03649

 > writePairwiseAlignments(xx)
########################################
# Program: Biostrings (version 2.29.7), a Bioconductor package
# Rundate: Wed Jun 19 11:15:27 2013
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 6
# Identity:       2/6 (33.3%)
# Similarity:    NA/6 (NA%)
# Gaps:           4/6 (66.7%)
# Score: -32.03649
#
#
#=======================================

P1                 1 AAGGAA      6
                        ||
S1                 1 --GG--      2

#---------------------------------------
#---------------------------------------

With your 2nd example:

   s1 <- DNAString("AGGGG")
   s2 <- DNAString("GGGGA")
   xx <- pairwiseAlignment(s1, s2, gapOpening=-1)

Then:

 > xx
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [2] GGGG
subject: [1] GGGG
score: -2.072976

 > writePairwiseAlignments(xx)
########################################
# Program: Biostrings (version 2.29.7), a Bioconductor package
# Rundate: Wed Jun 19 11:19:40 2013
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 5.0
# Extend_penalty: 4.0
#
# Length: 6
# Identity:       4/6 (66.7%)
# Similarity:    NA/6 (NA%)
# Gaps:           2/6 (33.3%)
# Score: -2.072976
#
#
#=======================================

P1                 1 AGGGG-      5
                       ||||
S1                 1 -GGGGA      5

#---------------------------------------
#---------------------------------------

It's sad though that 'aligned(pattern(xx))' and 'aligned(subject(xx))'
also trim those indels that are at the ends because they are really
part of the aligned sequences. However this behavior has been around for
so long now that I'm not sure we want to touch it. What we could to
is add an extra argument to the aligned() getter, say 'full.alignment',
with default to FALSE, to let the user extract the whole thing.

Would that be useful?

Cheers,
H.

On 06/15/2013 10:19 AM, Martin Morgan wrote:
> On 06/15/2013 09:18 AM, Robert K. Bradley wrote:
>> Thanks Martin.  as.character (aligned (...)) does return a gapped
>> string for one
>> of the two aligned sequences, but even that command trims off unaligned
>> characters.  For example:
>> For example:
>>  > s1 = DNAString ("AGGGG")
>>  > s2 = DNAString ("GGGGA")
>>  > xx = pairwiseAlignment (s1, s2, gapOpening = -1)
>>  > as.character (xx)
>> [1] "GGGG-"
>> while the whole gapped string is "AGGGG-".
>>
>> I apologize in advance if I'm missing something obvious here.
>
> Not really doing much better on my end; I'll offer this up then step
> aside until someone with more expertise chimes in. Maybe this is closer
> to what you're after, padding aligned(xx) with and left or right
> overhang of pattern
>
>      lraligned <- function(pattern, subject, ...) {
>          xx <- pairwiseAlignment (pattern, subject, gapOpening = -1)
>          aln <- aligned(xx)
>          lpad <- substr(pattern, 1L, start(pattern(xx)) - 1L)
>          rpad <- substr(pattern, end(pattern(xx)) + 1L, nchar(pattern))
>          paste0(lpad, aln, rpad)
>      }
>
>  > lraligned(DNAString ("AGGGG"), DNAString ("GGGGA"), type="global")
> [1] "AGGGG-"
>  > lraligned(DNAString ("GGGGA"), DNAString ("AGGGG"), type="global")
> [1] "-GGGGA"
>  > lraligned(DNAString("AAGGAA"), DNAString("GG"))
> [1] "AAGGAA"
>  > lraligned(DNAString("GG"), DNAString("AAGGAA"))
> [1] "--GG--"
>
> Martin
>
>>
>> Best,
>> Rob
>>
>> On 6/15/13 6:45 AM, Martin Morgan wrote:
>>> On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm interested in using the pairwiseAlignment function in Biostrings
>>>> to perform simple alignments of short sequences.  I noticed that even
>>>> the global alignment mode returns an alignment with gapped ends
>>>> trimmed off.  For example:
>>>>
>>>>> s1 = DNAString ("AAGGAA")
>>>>> s2 = DNAString ("GG")
>>>>> pairwiseAlignment (s1, s2, type = "global")
>>>> Global PairwiseAlignmentsSingleSubject (1 of 1)
>>>> pattern: [3] GG
>>>> subject: [1] GG
>>>> score: -32.03649
>>>>
>>>> Is it possible to obtain the complete alignment including the
>>>> (possibly) gapped ends?  In this case, that would be:
>>>> --GG--
>>>> AAGGAA
>>>> For my purposes, having the complete gapped alignment is important.
>>>
>>> Hi Robert -- Not an expert on pairwiseAlignment, but I think the return
>>> value is a class that contains more information than it is showing
>>>
>>>      > xx = pairwiseAlignment(s2, s1, type="global")
>>>      > class(xx)
>>>      [1] "PairwiseAlignmentsSingleSubject"
>>>      attr(,"package")
>>>      [1] "Biostrings"
>>>      > class?PairwiseAlignmentsSingleSubject
>>>      > aligned(xx)
>>>        A DNAStringSet instance of length 1
>>>          width seq
>>>      [1]     6 --GG--
>>>
>>> Martin
>>>
>>>>
>>>> My question seems related to a previous post:
>>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html
>>>>
>>>> Thank you in advance.
>>>>
>>>> Best wishes,
>>>> Rob
>>>>
>>>
>>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319