[BioC] Help about 'pairwiseAlignment' in Biostrings package

Hervé Pagès hpages at fhcrc.org
Wed Mar 9 19:41:25 CET 2011


Hi LiGang,

Note that by default, pairwiseAlignment() performs "global" alignment,
not "local". However, the example you provided looks more like
a needle/haystack situation (short pattern that must be fully aligned
to a substring of a long subject), so it looks like the type of
alignment you are looking for is "global-local".

   > library(Biostrings)
   > set.seed(12345)
   > toquery <- DNAString(paste(sample(c("A","G","C","T"),200, 
rep=TRUE),collapse=""))
   > toinsert <- DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')
   > to.subject<-c(toquery, toinsert, subseq(toquery, start=30, 
end=150),toinsert)
   > pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject, 
type="global-local")
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern:   [1] 
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
   subject: [251] 
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
   score: 140.7047

Note that this result is *very* different from what you get when doing
"global" alignment:

   > pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
   Global PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] 
CTT--------------------------------...TCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
   subject: [1] 
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAACG...TCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
   score: -1119.295

This is because when doing "global" ("global" is short for "global-
global"), pairwiseAlignment() tries to align the pattern to the full
subject, so in your case (short pattern) it needs to "stretch" the
pattern a lot (by introducing gaps in it) to make it "fit" the subject.

Anyway, using "global-local" doesn't solve the problem that, when more
than one match achieves the best possible score, pairwiseAlignment()
only reports one of them (the last one I believe). This is a known
limitation of pairwiseAlignment().

Here are 2 possible workarounds/alternatives:

   1. If you really need all the richness/flexibility of
      the pairwiseAlignment() function (i.e. its ability to support
      all kinds of scoring scheme thru the 'patternQuality',
      'subjectQuality', 'substitutionMatrix', 'fuzzyMatrix',
      'gapOpening' and 'gapExtension' arguments), then
      you can call it again on the truncated subject to find any
      other potential match:

      > pairwiseAlignment(subseq(toquery,start=50,end=120), 
subseq(to.subject, end=250), type="global-local")
      Global-Local PairwiseAlignedFixedSubject (1 of 1)
      pattern:  [1] 
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
      subject: [50] 
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
      score: 140.7047

      You'll need to write your own function that uses a loop to truncate
      and search again. At each step you'll want to look at the score of
      the new match you get (the score should only go down) and make sure
      that it's close enough to the score of the original match. You exit
      the loop if it's not.
      Note that this solution doesn't deal properly with the (rare)
      situation where 2 best matches are overlapping. Also, it's
      probably not going to be very efficient.

   2. If you don't need all the richness/flexibility of
      pairwiseAlignment(), and if it's ok for you to allow only a small
      number of indels, then you could always use the matchPattern()
      function:

      > matchPattern(subseq(toquery,start=50,end=120), to.subject, 
max.mismatch=10, with.indels=TRUE)
        Views on a 381-letter DNAString subject
      subject: 
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAAC...TGTGGCGGTGCCCGGGGCATCTCCGCGGCGGAAC
      views:
          start end width
      [1]    50 120    71 
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]
      [2]   251 321    71 
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]

      It's less flexible than pairwiseAlignment() but it returns all the
      matches. It's also faster and more memory efficient. See
      ?matchPattern for the details.

Cheers,
H.

On 03/08/2011 10:05 PM, ligang wrote:
> Hello list,
>
> I'm using 'pairwiseAlignment' function of Biostrings package to do sequences
> alignment, and found that when do local alignment, if there were several
> fragments matched between 'pattern' and 'subject' sequence, only one was found
> by 'pairwiseAlignment' function, see the following example:
>
>
> #######################
>
> set.seed(12345)
> DNAString(paste(sample(c("A","G","C","T"),200, rep=TRUE),collapse=""))->toquery
> DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')->toinsert
> to.subject<-c(toquery, toinsert, subseq(toquery, start=30, end=150),toinsert)
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
>
>
> #######################
>
> how could I found all the "matches"?
>
> LiGang
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list