[BioC] Biostring: print sequence alignment to file

Hervé Pagès hpages at fhcrc.org
Thu Jul 19 08:10:57 CEST 2012


Hi Martin,

On 06/22/2012 12:09 PM, Hervé Pagès wrote:
> Hi Martin,
>
> On 06/14/2012 06:55 AM, Martin Preusse wrote:
>> Hi guys,
>>
>> anything new on the sequence output? Maybe I missed something :)
>> please tell me if you need testing etc.
>
> Still on my list. Will work on this in the next couple of weeks. I'll
> let you know. Thanks for the reminder.

There is now a writePairwiseAlignments() function (in Biostrings 2.25.8)
for doing this. It produces a file in the "pair" format (as described on
the EMBOSS website, at the URL you sent earlier):

   > library(Biostrings)
   > pattern <- DNAString("CGTACGTAACGTTCGT")
   > subject <- DNAString("CGTCGTCGTCCGTAA")
   > x1 <- pairwiseAlignment(pattern, subject)

   > x1
   Global PairwiseAlignmentsSingleSubject (1 of 1)
   pattern: [1] CGTACGTAACGTTCGT
   subject: [1] CGT-CGT--CGTCCGT
   score: -32.11822

   > writePairwiseAlignments(x1, block.width=10)  # 'block.width' 
default is 50
   ########################################
   # Program: Biostrings (version 2.25.8), a Bioconductor package
   # Rundate: Wed Jul 18 23:05:46 2012
   ########################################
   #=======================================
   #
   # Aligned_sequences: 2
   # 1: P1
   # 2: S1
   # Matrix: NA
   # Gap_penalty: 14.0
   # Extend_penalty: 4.0
   #
   # Length: 18
   # Identity:      12/18 (66.7%)
   # Similarity:    NA/18 (NA%)
   # Gaps:           5/18 (27.8%)
   # Score: -32.11822
   #
   #
   #=======================================

   P1                 1 CGTACGTAAC     10
                        ||| |||  |
   S1                 1 CGT-CGT--C      7

   P1                11 GTTCGT--     16
                        || |||
   S1                 8 GTCCGTAA     15


   #---------------------------------------
   #---------------------------------------


Only lightly tested. Not necessarily very performant (no C code). Please
have a look at the man page for some caveats (especially if you plan to
use it on NON global alignments). Feedback welcome.

Thanks,
H.

>
> H.
>
>>
>> Cheers
>> Martin
>>
>>
>> Am Samstag, 21. April 2012 um 11:55 schrieb Martin Preusse:
>>
>>> Hi Hervé,
>>>
>>> thanks for your help! If you need suggestions, help or testing, just
>>> say the word.
>>>
>>> Will you implement the header also? If you do so, I would be thankful
>>> for an option like "header=F" for the output.
>>>
>>>
>>> Cheers
>>> Martin
>>>
>>>
>>> Am Samstag, 21. April 2012 um 02:12 schrieb Hervé Pagès:
>>>
>>>> Thanks Martin and Thomas for the useful feedback. The 'pair' and
>>>> 'markx0' formats supported by Emboss seem indeed appropriate for
>>>> printing the output of pairwiseAlignment() to a file. I'll add
>>>> support for those 2 formats in Biostrings. Won't be before 1 week
>>>> or 2 though...
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>> On 04/18/2012 03:20 AM, Martin Preusse wrote:
>>>>> Hi,
>>>>>
>>>>> I just found this function to print a pairwise alignments in
>>>>> blocks. Doesn't add the match/mismatch indicators between
>>>>> sequences, but might be a starting point:
>>>>>
>>>>> http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment
>>>>>
>>>>>
>>>>>
>>>>> Cheers
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>> Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:
>>>>>
>>>>>> Hi everybody,
>>>>>>
>>>>>> I think the output format depends on the purpose of the alignment.
>>>>>>
>>>>>> A pairwise sequence alignment is usually done to compare two
>>>>>> sequences base by base. In my case, I compare sequencing results
>>>>>> of cloned expression constructs with the desired sequence. Thus,
>>>>>> the best output format would be "BLAST like".
>>>>>>
>>>>>> seq1: 1 ATCTGC 7
>>>>>> | | | . . |
>>>>>> seq2: 1 ATCAAC 7
>>>>>>
>>>>>> When doing MSA, most people might rather be interested in the
>>>>>> consensus sequence. E.g. in the context of conservation between
>>>>>> species.
>>>>>>
>>>>>> So write.PairwiseAlignedXStringSet() and write.MultipleAlignment()
>>>>>> are quite different and BLAST doesn't make much sense for multiple
>>>>>> alignments. This means it would be best to put the output in the
>>>>>> PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
>>>>>>
>>>>>> This is an overview of sequence alignment formats used by EMBOSS:
>>>>>> http://emboss.sourceforge.net/docs/themes/AlignFormats.html
>>>>>>
>>>>>> 'pair' or 'markx0' would be perfectly fine.
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
>>>>>>
>>>>>>> Hi Hervé,
>>>>>>>
>>>>>>> To me, the most basic and versatile MSA or pairwise alignment
>>>>>>> format to output
>>>>>>> to would be FASTA since it is compatible with almost any other
>>>>>>> alignment
>>>>>>> editing software. For text-based viewing purposes my preference
>>>>>>> would be
>>>>>>> to also output to a format similar to the one shown in the following
>>>>>>> example. When there are only two sequences then one could show
>>>>>>> instead
>>>>>>> of a consensus line the pipe characters between the two sequences to
>>>>>>> indicate identical residues which mimics the blast output. A more
>>>>>>> standardized version of this pairwise alignment format can be found
>>>>>>> here:
>>>>>>> http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html
>>>>>>>
>>>>>>> library(Biostrings)
>>>>>>> p450<-
>>>>>>> read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul",
>>>>>>> "fasta")
>>>>>>>
>>>>>>> StringSet2html<- function(msa=p450, file="p450.html", start=1,
>>>>>>> end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
>>>>>>> if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start,
>>>>>>> end=end)
>>>>>>> if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa,
>>>>>>> start=start, end=end)
>>>>>>> msavec<- sapply(msa, toString)
>>>>>>> offset<- (counter-1)-nchar(nchar(msavec[1]))
>>>>>>> legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""),
>>>>>>> format(seq(0,
>>>>>>> nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
>>>>>>> consensus<- consensusString(msavec, ambiguityMap=".", ...)
>>>>>>> msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
>>>>>>> msavec<- paste(format(c("", names(msa), "Consensus"),
>>>>>>> justify="left"), c(legend, msavec,
>>>>>>> consensus), sep=" ")
>>>>>>> msavec<- c("<html><pre>", msavec,"</pre></html>")
>>>>>>> writeLines(msavec, file)
>>>>>>> if(browser==TRUE) { browseURL(file) }
>>>>>>> }
>>>>>>> StringSet2html(msa=p450, file="p450.html", start=1,
>>>>>>> end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)
>>>>>>> StringSet2html(msa=p450, file="p450.html", start=450, end=470,
>>>>>>> counter=20, browser=T, threshold=1.0)
>>>>>>>
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> On 04/17/2012 11:49 AM, Thomas Girke wrote:
>>>>>>>>> What about providing an option in pairwiseAlignment to output
>>>>>>>>> to the
>>>>>>>>> MultipleAlignment class in Biostrings and then write the latter to
>>>>>>>>> different alignment formats?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Or we could provide coercion methods to switch between
>>>>>>>> PairwiseAlignedXStringSet and MultipleAlignment.
>>>>>>>>
>>>>>>>> Anyway that kind of moves Martin's problem from having a
>>>>>>>> write.PairwiseAlignedXStringSet() function that produces BLAST
>>>>>>>> output
>>>>>>>> to having a write.MultipleAlignment() function that produces BLAST
>>>>>>>> output. For the specific case of BLAST output, would it make sense
>>>>>>>> to support it for MultipleAlignment? Can someone point me to an
>>>>>>>> example
>>>>>>>> of such output? Or even better, to the specs of such format?
>>>>>>>>
>>>>>>>> Note that right now there is the write.phylip() function in
>>>>>>>> Biostrings
>>>>>>>> for writing a MultipleAlignment object to a file but the Phylip
>>>>>>>> format
>>>>>>>> looks very different from the BLAST output:
>>>>>>>>
>>>>>>>> hpages at latitude:~$ head -n 20 phylip_test.txt
>>>>>>>> 9 2343
>>>>>>>> Mask 0000000000 0000000000 0000000000 0000000000 0000000000
>>>>>>>> Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
>>>>>>>> Chimp ---------- ---------- ---------- ---------- ----------
>>>>>>>> Cow ---------- ---------- ---------- ---------- ----------
>>>>>>>> Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
>>>>>>>> Rat ---------- ---------- ---------- ---------- ----------
>>>>>>>> Dog ---------- ---------- ---------- ---------- ----------
>>>>>>>> Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
>>>>>>>> Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
>>>>>>>>
>>>>>>>> 0000000000 0000000000 0000000000 0001111111 1111111111
>>>>>>>> CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>> ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
>>>>>>>> CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>> AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
>>>>>>>> CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> H.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>> On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
>>>>>>>>>> Hi Martin,
>>>>>>>>>>
>>>>>>>>>> On 04/16/2012 04:06 AM, Martin Preusse wrote:
>>>>>>>>>>> Hi Charles,
>>>>>>>>>>>
>>>>>>>>>>> thanks! Your solution allows to print the two alignment
>>>>>>>>>>> strings separately.
>>>>>>>>>>>
>>>>>>>>>>> I was thinking of an output as generated by alignment tools:
>>>>>>>>>>>
>>>>>>>>>>> AGT-TCTAT
>>>>>>>>>>> | | | | | | | | |
>>>>>>>>>>> AGTATCTAT
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This looks like BLAST output. Is this what you have in mind?
>>>>>>>>>> Note that
>>>>>>>>>> there are many alignment tools and many ways to output the
>>>>>>>>>> result to a
>>>>>>>>>> file. I'm not really familiar with the BLAST output format. Is it
>>>>>>>>>> specified somewhere? Would that make sense to add something
>>>>>>>>>> like a
>>>>>>>>>> write.PairwiseAlignedXStringSet() function to Biostrings for
>>>>>>>>>> writing
>>>>>>>>>> the result of pairwiseAlignment() to a file? We could do this and
>>>>>>>>>> support the BLAST format if that's a commonly used format.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> H.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For this I would have to write a function to output the
>>>>>>>>>>> strings in blocks of e.g. 60 nucleotides, right?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Martin
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
>>>>>>>>>>>
>>>>>>>>>>>> write.XStringSet
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>> Search the archives:
>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Hervé Pagès
>>>>>>>>>>
>>>>>>>>>> Program in Computational Biology
>>>>>>>>>> Division of Public Health Sciences
>>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>>> P.O. Box 19024
>>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>>
>>>>>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>>>>>> Phone: (206) 667-5791
>>>>>>>>>> Fax: (206) 667-1319
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Hervé Pagès
>>>>>>>>
>>>>>>>> Program in Computational Biology
>>>>>>>> Division of Public Health Sciences
>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>> P.O. Box 19024
>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>
>>>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>>>> Phone: (206) 667-5791
>>>>>>>> Fax: (206) 667-1319
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>
>>
>>
>>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list