[BioC] Biostring: print sequence alignment to file

Hervé Pagès hpages at fhcrc.org
Sat Apr 21 02:12:47 CEST 2012


Thanks Martin and Thomas for the useful feedback. The 'pair' and
'markx0' formats supported by Emboss seem indeed appropriate for
printing the output of pairwiseAlignment() to a file. I'll add
support for those 2 formats in Biostrings. Won't be before 1 week
or 2 though...

Cheers,
H.

On 04/18/2012 03:20 AM, Martin Preusse wrote:
> Hi,
>
> I just found this function to print a pairwise alignments in blocks. Doesn't add the match/mismatch indicators between sequences, but might be a starting point:
>
> http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment
>
>
> Cheers
> Martin
>
>
>
> Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:
>
>> Hi everybody,
>>
>> I think the output format depends on the purpose of the alignment.
>>
>> A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like".
>>
>> seq1: 1 ATCTGC 7
>> | | | . . |
>> seq2: 1 ATCAAC 7
>>
>> When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species.
>>
>> So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
>>
>> This is an overview of sequence alignment formats used by EMBOSS:
>> http://emboss.sourceforge.net/docs/themes/AlignFormats.html
>>
>> 'pair' or 'markx0' would be perfectly fine.
>>
>>
>> Cheers
>> Martin
>>
>>
>>
>> Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
>>
>>> Hi Hervé,
>>>
>>> To me, the most basic and versatile MSA or pairwise alignment format to output
>>> to would be FASTA since it is compatible with almost any other alignment
>>> editing software. For text-based viewing purposes my preference would be
>>> to also output to a format similar to the one shown in the following
>>> example. When there are only two sequences then one could show instead
>>> of a consensus line the pipe characters between the two sequences to
>>> indicate identical residues which mimics the blast output. A more
>>> standardized version of this pairwise alignment format can be found
>>> here:
>>> http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html
>>>
>>> library(Biostrings)
>>> p450<- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta")
>>>
>>> StringSet2html<- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
>>> if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start, end=end)
>>> if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa, start=start, end=end)
>>> msavec<- sapply(msa, toString)
>>> offset<- (counter-1)-nchar(nchar(msavec[1]))
>>> legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0,
>>> nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
>>> consensus<- consensusString(msavec, ambiguityMap=".", ...)
>>> msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
>>> msavec<- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec,
>>> consensus), sep=" ")
>>> msavec<- c("<html><pre>", msavec,"</pre></html>")
>>> writeLines(msavec, file)
>>> if(browser==TRUE) { browseURL(file) }
>>> }
>>> StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)
>>> StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0)
>>>
>>>
>>> Thomas
>>>
>>> On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
>>>> Hi Thomas,
>>>>
>>>> On 04/17/2012 11:49 AM, Thomas Girke wrote:
>>>>> What about providing an option in pairwiseAlignment to output to the
>>>>> MultipleAlignment class in Biostrings and then write the latter to
>>>>> different alignment formats?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Or we could provide coercion methods to switch between
>>>> PairwiseAlignedXStringSet and MultipleAlignment.
>>>>
>>>> Anyway that kind of moves Martin's problem from having a
>>>> write.PairwiseAlignedXStringSet() function that produces BLAST output
>>>> to having a write.MultipleAlignment() function that produces BLAST
>>>> output. For the specific case of BLAST output, would it make sense
>>>> to support it for MultipleAlignment? Can someone point me to an example
>>>> of such output? Or even better, to the specs of such format?
>>>>
>>>> Note that right now there is the write.phylip() function in Biostrings
>>>> for writing a MultipleAlignment object to a file but the Phylip format
>>>> looks very different from the BLAST output:
>>>>
>>>> hpages at latitude:~$ head -n 20 phylip_test.txt
>>>> 9 2343
>>>> Mask 0000000000 0000000000 0000000000 0000000000 0000000000
>>>> Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
>>>> Chimp ---------- ---------- ---------- ---------- ----------
>>>> Cow ---------- ---------- ---------- ---------- ----------
>>>> Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
>>>> Rat ---------- ---------- ---------- ---------- ----------
>>>> Dog ---------- ---------- ---------- ---------- ----------
>>>> Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
>>>> Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
>>>>
>>>> 0000000000 0000000000 0000000000 0001111111 1111111111
>>>> CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>> ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
>>>> CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>> AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
>>>> CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
>>>>
>>>> Thanks!
>>>> H.
>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>> On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
>>>>>> Hi Martin,
>>>>>>
>>>>>> On 04/16/2012 04:06 AM, Martin Preusse wrote:
>>>>>>> Hi Charles,
>>>>>>>
>>>>>>> thanks! Your solution allows to print the two alignment strings separately.
>>>>>>>
>>>>>>> I was thinking of an output as generated by alignment tools:
>>>>>>>
>>>>>>> AGT-TCTAT
>>>>>>> | | | | | | | | |
>>>>>>> AGTATCTAT
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This looks like BLAST output. Is this what you have in mind? Note that
>>>>>> there are many alignment tools and many ways to output the result to a
>>>>>> file. I'm not really familiar with the BLAST output format. Is it
>>>>>> specified somewhere? Would that make sense to add something like a
>>>>>> write.PairwiseAlignedXStringSet() function to Biostrings for writing
>>>>>> the result of pairwiseAlignment() to a file? We could do this and
>>>>>> support the BLAST format if that's a commonly used format.
>>>>>>
>>>>>> Thanks,
>>>>>> H.
>>>>>>
>>>>>>>
>>>>>>> For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
>>>>>>>
>>>>>>>> write.XStringSet
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>> Phone: (206) 667-5791
>>>>>> Fax: (206) 667-1319
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>
>>
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list