[BioC] Biostring: print sequence alignment to file

Martin Preusse martin.preusse at googlemail.com
Wed Apr 18 12:16:57 CEST 2012


Hi everybody,  

I think the output format depends on the purpose of the alignment.  

A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like".

seq1: 1 ATCTGC 7
             | | | . . |
seq2: 1 ATCAAC 7

When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species.

So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?

This is an overview of sequence alignment formats used by EMBOSS:
http://emboss.sourceforge.net/docs/themes/AlignFormats.html

'pair' or 'markx0' would be perfectly fine.


Cheers
Martin



Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:

> Hi Hervé,
>  
> To me, the most basic and versatile MSA or pairwise alignment format to output
> to would be FASTA since it is compatible with almost any other alignment
> editing software. For text-based viewing purposes my preference would be
> to also output to a format similar to the one shown in the following
> example. When there are only two sequences then one could show instead
> of a consensus line the pipe characters between the two sequences to
> indicate identical residues which mimics the blast output. A more
> standardized version of this pairwise alignment format can be found
> here:
> http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html  
>  
> library(Biostrings)
> p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta")  
>  
> StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
> if(class(msa)=="AAStringSet") msa <- AAStringSet(msa, start=start, end=end)
> if(class(msa)=="DNAStringSet") msa <- DNAStringSet(msa, start=start, end=end)
> msavec <- sapply(msa, toString)
> offset <- (counter-1)-nchar(nchar(msavec[1]))
> legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0,  
> nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
> consensus <- consensusString(msavec, ambiguityMap=".", ...)
> msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
> msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec,  
> consensus), sep=" ")
> msavec <- c("<html><pre>", msavec, "</pre></html>")
> writeLines(msavec, file)
> if(browser==TRUE) { browseURL(file) }
> }
> StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)  
> StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0)  
>  
>  
> Thomas
>  
> On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
> > Hi Thomas,
> >  
> > On 04/17/2012 11:49 AM, Thomas Girke wrote:
> > > What about providing an option in pairwiseAlignment to output to the
> > > MultipleAlignment class in Biostrings and then write the latter to
> > > different alignment formats?
> >  
> >  
> >  
> >  
> > Or we could provide coercion methods to switch between
> > PairwiseAlignedXStringSet and MultipleAlignment.
> >  
> > Anyway that kind of moves Martin's problem from having a
> > write.PairwiseAlignedXStringSet() function that produces BLAST output
> > to having a write.MultipleAlignment() function that produces BLAST
> > output. For the specific case of BLAST output, would it make sense
> > to support it for MultipleAlignment? Can someone point me to an example
> > of such output? Or even better, to the specs of such format?
> >  
> > Note that right now there is the write.phylip() function in Biostrings
> > for writing a MultipleAlignment object to a file but the Phylip format
> > looks very different from the BLAST output:
> >  
> > hpages at latitude:~$ head -n 20 phylip_test.txt
> > 9 2343
> > Mask 0000000000 0000000000 0000000000 0000000000 0000000000
> > Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
> > Chimp ---------- ---------- ---------- ---------- ----------
> > Cow ---------- ---------- ---------- ---------- ----------
> > Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
> > Rat ---------- ---------- ---------- ---------- ----------
> > Dog ---------- ---------- ---------- ---------- ----------
> > Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
> > Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
> >  
> > 0000000000 0000000000 0000000000 0001111111 1111111111
> > CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
> > ---------- ---------- ---------- ---------- ---A-TGGCT
> > ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
> > CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
> > ---------- ---------- ---------- ---------- ---A-TGGCT
> > ---------- ---------- ---------- ---------- ---A-TGGCT
> > AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
> > CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
> >  
> > Thanks!
> > H.
> >  
> > >  
> > > Thomas
> > >  
> > > On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
> > > > Hi Martin,
> > > >  
> > > > On 04/16/2012 04:06 AM, Martin Preusse wrote:
> > > > > Hi Charles,
> > > > >  
> > > > > thanks! Your solution allows to print the two alignment strings separately.
> > > > >  
> > > > > I was thinking of an output as generated by alignment tools:
> > > > >  
> > > > > AGT-TCTAT
> > > > > | | | | | | | | |
> > > > > AGTATCTAT
> > > >  
> > > >  
> > > >  
> > > >  
> > > > This looks like BLAST output. Is this what you have in mind? Note that
> > > > there are many alignment tools and many ways to output the result to a
> > > > file. I'm not really familiar with the BLAST output format. Is it
> > > > specified somewhere? Would that make sense to add something like a
> > > > write.PairwiseAlignedXStringSet() function to Biostrings for writing
> > > > the result of pairwiseAlignment() to a file? We could do this and
> > > > support the BLAST format if that's a commonly used format.
> > > >  
> > > > Thanks,
> > > > H.
> > > >  
> > > > >  
> > > > > For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right?
> > > > >  
> > > > > Cheers
> > > > > Martin
> > > > >  
> > > > >  
> > > > >  
> > > > > Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
> > > > >  
> > > > > > write.XStringSet
> > > > >  
> > > > > _______________________________________________
> > > > > Bioconductor mailing list
> > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > > --
> > > > Hervé Pagès
> > > >  
> > > > Program in Computational Biology
> > > > Division of Public Health Sciences
> > > > Fred Hutchinson Cancer Research Center
> > > > 1100 Fairview Ave. N, M1-B514
> > > > P.O. Box 19024
> > > > Seattle, WA 98109-1024
> > > >  
> > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > Phone: (206) 667-5791
> > > > Fax: (206) 667-1319
> > > >  
> > > > _______________________________________________
> > > > Bioconductor mailing list
> > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >  
> >  
> >  
> >  
> >  
> >  
> > --  
> > Hervé Pagès
> >  
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >  
> > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
>  



More information about the Bioconductor mailing list