[BioC] pairwiseAlignment generates different outcomes for the same input sequences

Hervé Pagès hpages at fhcrc.org
Mon Mar 14 22:57:37 CET 2011


Hi Alex,

Yes, I find it surprising too that you don't get the same thing.

With this simpler example:

   > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local")
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AA-AA
   subject: [4] AAGAA
   score: 17.92695

   > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), 
type="global-local")
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [4] AAGA
   score: 0.0459826

Note that I would *definititely* expect to get the same thing when
a 'substitutionMatrix' is provided. And this seems to be actually the
case (as long as all the letters in 'pattern' and 'subject' are
represented in the substitution matrix):

   > mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3, 
baseOnly=TRUE)
   > mat
      A  C  G  T
   A  1 -3 -3 -3
   C -3  1 -3 -3
   G -3 -3  1 -3
   T -3 -3 -3  1

   > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local", 
substitutionMatrix=mat)
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [5] AGAA
   score: 0

   > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), 
type="global-local", substitutionMatrix=mat)
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [5] AGAA
   score: 0

However, by default (i.e. when no 'substitutionMatrix' is provided),
pairwiseAlignment() performs *quality-based* alignments. According
to the man page, this means that, internally, a substitution matrix
is generated (and used) based on the qualities of the letters in the
pattern and the subject (a constant default quality is assigned to
all the letters). It's unclear to me why the code generating this
substitution matrix would return different matrices depending on
whether 'pattern' and 'subject' are ordinary character vectors or
DNAString objects. It seems that the matrix generated for DNAString
objects makes it harder to see gaps in the pattern. Anyway, I need
to have a closer look at this before I can comment more.

Cheers,
H.


On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
> Dear Mailing List,
>
> Why does pairwiseAlignment() generate different outcomes for the same input
>   sequences defined differently in terms of classes (see showMethods below):
>
> for
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="DNAString", subject="DNAString"
>
> ?
>
> It generates the same outcomes for the case of
>
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="character", subject="DNAString"
>
>
> It looks like a bug.
>
> Thanks
>
> Alex
>
>
>
>
>
>
> #showMethods("pairwiseAlignment")
> #Function: pairwiseAlignment (package Biostrings)
> #pattern="character",  subject="character"
> #pattern="character",  subject="DNAString"
> #    (inherited from: pattern="character",  subject="XString")
> #pattern="DNAString", subject="DNAString"
>
>> pattern.1
> 50-letter "DNAString" instance
> seq:  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> pattern.2
> [1]  "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>> subject.1
> 543-letter "DNAString" instance
> seq:
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
> CGTCGT
>>   subject.2
> [1]
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
> CAGTAGTTACGCGTCGT"
>>
>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:   [1]  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:  [1]  CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
> subject: [80]  CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
> score: 95.69296
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:   [1]  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>
>> sessionInfo()
> R version 2.12.2 (2011-02-25)
> Platform:  i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252   LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United  States.1252
> [4]  LC_NUMERIC=C                            LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats     graphics   grDevices utils     datasets  methods    base
>
> other attached packages:
> [1] altcdfenvs_2.12.0 hypergraph_1.22.0  graph_1.28.0
> makecdfenv_1.28.0  affy_1.28.0       Biobase_2.10.0     GeneR_2.20.0      seqinr_3.0-1
> [9] Biostrings_2.18.4 IRanges_1.8.9      limma_3.6.9
>
> loaded via a namespace (and not attached):
> [1]  affyio_1.18.0          preprocessCore_1.12.0  tools_2.12.2
>
>
>
>
> ###############################
>
> Nutritional Sciences and Toxicology,
> 119 Morgan Hall
> UC.Berkeley,CA 94720
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list