[BioC] Biostring: print sequence alignment to file

Hervé Pagès hpages at fhcrc.org
Thu Jul 19 19:23:04 CEST 2012


Hi Martin,

On 07/19/2012 04:30 AM, Martin Preusse wrote:
> Hi Hervé,
>
> thanks! This looks great. I have a question though: How can I install the latest version of Biostrings?
>
> When I install with:
>
> source("http://bioconductor.org/biocLite.R")
> biocLite("Biostrings")
>
>
>
> I get Version 2.22.0 … ;)

Biostrings 2.25 is part of the current devel version of BioC
(BioC 2.11). To install BioC devel, see this link:

   http://bioconductor.org/developers/useDevel/

Note that today Biostrings 2.25.7 is the latest version available thru
biocLite():

   http://bioconductor.org/packages/2.11/bioc/html/Biostrings.html

If everything goes well, 2.25.8 will become available tomorrow morning
around 10 am (Seattle time). Please let me know if you run into any
issue with this.

Thanks!
H.

>
>
> Cheers
> Martin
>
>
>
> Am Donnerstag, 19. Juli 2012 um 08:10 schrieb Hervé Pagès:
>
>> Hi Martin,
>>
>> On 06/22/2012 12:09 PM, Hervé Pagès wrote:
>>> Hi Martin,
>>>
>>> On 06/14/2012 06:55 AM, Martin Preusse wrote:
>>>> Hi guys,
>>>>
>>>> anything new on the sequence output? Maybe I missed something :)
>>>> please tell me if you need testing etc.
>>>
>>>
>>>
>>> Still on my list. Will work on this in the next couple of weeks. I'll
>>> let you know. Thanks for the reminder.
>>
>>
>>
>> There is now a writePairwiseAlignments() function (in Biostrings 2.25.8)
>> for doing this. It produces a file in the "pair" format (as described on
>> the EMBOSS website, at the URL you sent earlier):
>>
>>> library(Biostrings)
>>> pattern <- DNAString("CGTACGTAACGTTCGT")
>>> subject <- DNAString("CGTCGTCGTCCGTAA")
>>> x1 <- pairwiseAlignment(pattern, subject)
>>
>>
>>> x1
>> Global PairwiseAlignmentsSingleSubject (1 of 1)
>> pattern: [1] CGTACGTAACGTTCGT
>> subject: [1] CGT-CGT--CGTCCGT
>> score: -32.11822
>>
>>> writePairwiseAlignments(x1, block.width=10) # 'block.width'
>> default is 50
>> ########################################
>> # Program: Biostrings (version 2.25.8), a Bioconductor package
>> # Rundate: Wed Jul 18 23:05:46 2012
>> ########################################
>> #=======================================
>> #
>> # Aligned_sequences: 2
>> # 1: P1
>> # 2: S1
>> # Matrix: NA
>> # Gap_penalty: 14.0
>> # Extend_penalty: 4.0
>> #
>> # Length: 18
>> # Identity: 12/18 (66.7%)
>> # Similarity: NA/18 (NA%)
>> # Gaps: 5/18 (27.8%)
>> # Score: -32.11822
>> #
>> #
>> #=======================================
>>
>> P1 1 CGTACGTAAC 10
>> ||| ||| |
>> S1 1 CGT-CGT--C 7
>>
>> P1 11 GTTCGT-- 16
>> || |||
>> S1 8 GTCCGTAA 15
>>
>>
>> #---------------------------------------
>> #---------------------------------------
>>
>>
>> Only lightly tested. Not necessarily very performant (no C code). Please
>> have a look at the man page for some caveats (especially if you plan to
>> use it on NON global alignments). Feedback welcome.
>>
>> Thanks,
>> H.
>>
>>>
>>> H.
>>>
>>>>
>>>> Cheers
>>>> Martin
>>>>
>>>>
>>>> Am Samstag, 21. April 2012 um 11:55 schrieb Martin Preusse:
>>>>
>>>>> Hi Hervé,
>>>>>
>>>>> thanks for your help! If you need suggestions, help or testing, just
>>>>> say the word.
>>>>>
>>>>> Will you implement the header also? If you do so, I would be thankful
>>>>> for an option like "header=F" for the output.
>>>>>
>>>>>
>>>>> Cheers
>>>>> Martin
>>>>>
>>>>>
>>>>> Am Samstag, 21. April 2012 um 02:12 schrieb Hervé Pagès:
>>>>>
>>>>>> Thanks Martin and Thomas for the useful feedback. The 'pair' and
>>>>>> 'markx0' formats supported by Emboss seem indeed appropriate for
>>>>>> printing the output of pairwiseAlignment() to a file. I'll add
>>>>>> support for those 2 formats in Biostrings. Won't be before 1 week
>>>>>> or 2 though...
>>>>>>
>>>>>> Cheers,
>>>>>> H.
>>>>>>
>>>>>> On 04/18/2012 03:20 AM, Martin Preusse wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I just found this function to print a pairwise alignments in
>>>>>>> blocks. Doesn't add the match/mismatch indicators between
>>>>>>> sequences, but might be a starting point:
>>>>>>>
>>>>>>> http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:
>>>>>>>
>>>>>>>> Hi everybody,
>>>>>>>>
>>>>>>>> I think the output format depends on the purpose of the alignment.
>>>>>>>>
>>>>>>>> A pairwise sequence alignment is usually done to compare two
>>>>>>>> sequences base by base. In my case, I compare sequencing results
>>>>>>>> of cloned expression constructs with the desired sequence. Thus,
>>>>>>>> the best output format would be "BLAST like".
>>>>>>>>
>>>>>>>> seq1: 1 ATCTGC 7
>>>>>>>> | | | . . |
>>>>>>>> seq2: 1 ATCAAC 7
>>>>>>>>
>>>>>>>> When doing MSA, most people might rather be interested in the
>>>>>>>> consensus sequence. E.g. in the context of conservation between
>>>>>>>> species.
>>>>>>>>
>>>>>>>> So write.PairwiseAlignedXStringSet() and write.MultipleAlignment()
>>>>>>>> are quite different and BLAST doesn't make much sense for multiple
>>>>>>>> alignments. This means it would be best to put the output in the
>>>>>>>> PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
>>>>>>>>
>>>>>>>> This is an overview of sequence alignment formats used by EMBOSS:
>>>>>>>> http://emboss.sourceforge.net/docs/themes/AlignFormats.html
>>>>>>>>
>>>>>>>> 'pair' or 'markx0' would be perfectly fine.
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
>>>>>>>>
>>>>>>>>> Hi Hervé,
>>>>>>>>>
>>>>>>>>> To me, the most basic and versatile MSA or pairwise alignment
>>>>>>>>> format to output
>>>>>>>>> to would be FASTA since it is compatible with almost any other
>>>>>>>>> alignment
>>>>>>>>> editing software. For text-based viewing purposes my preference
>>>>>>>>> would be
>>>>>>>>> to also output to a format similar to the one shown in the following
>>>>>>>>> example. When there are only two sequences then one could show
>>>>>>>>> instead
>>>>>>>>> of a consensus line the pipe characters between the two sequences to
>>>>>>>>> indicate identical residues which mimics the blast output. A more
>>>>>>>>> standardized version of this pairwise alignment format can be found
>>>>>>>>> here:
>>>>>>>>> http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html
>>>>>>>>>
>>>>>>>>> library(Biostrings)
>>>>>>>>> p450<-
>>>>>>>>> read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul",
>>>>>>>>> "fasta")
>>>>>>>>>
>>>>>>>>> StringSet2html<- function(msa=p450, file="p450.html", start=1,
>>>>>>>>> end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
>>>>>>>>> if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start,
>>>>>>>>> end=end)
>>>>>>>>> if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa,
>>>>>>>>> start=start, end=end)
>>>>>>>>> msavec<- sapply(msa, toString)
>>>>>>>>> offset<- (counter-1)-nchar(nchar(msavec[1]))
>>>>>>>>> legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""),
>>>>>>>>> format(seq(0,
>>>>>>>>> nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
>>>>>>>>> consensus<- consensusString(msavec, ambiguityMap=".", ...)
>>>>>>>>> msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
>>>>>>>>> msavec<- paste(format(c("", names(msa), "Consensus"),
>>>>>>>>> justify="left"), c(legend, msavec,
>>>>>>>>> consensus), sep=" ")
>>>>>>>>> msavec<- c("<html><pre>", msavec,"</pre></html>")
>>>>>>>>> writeLines(msavec, file)
>>>>>>>>> if(browser==TRUE) { browseURL(file) }
>>>>>>>>> }
>>>>>>>>> StringSet2html(msa=p450, file="p450.html", start=1,
>>>>>>>>> end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)
>>>>>>>>> StringSet2html(msa=p450, file="p450.html", start=450, end=470,
>>>>>>>>> counter=20, browser=T, threshold=1.0)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>> On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
>>>>>>>>>> Hi Thomas,
>>>>>>>>>>
>>>>>>>>>> On 04/17/2012 11:49 AM, Thomas Girke wrote:
>>>>>>>>>>> What about providing an option in pairwiseAlignment to output
>>>>>>>>>>> to the
>>>>>>>>>>> MultipleAlignment class in Biostrings and then write the latter to
>>>>>>>>>>> different alignment formats?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Or we could provide coercion methods to switch between
>>>>>>>>>> PairwiseAlignedXStringSet and MultipleAlignment.
>>>>>>>>>>
>>>>>>>>>> Anyway that kind of moves Martin's problem from having a
>>>>>>>>>> write.PairwiseAlignedXStringSet() function that produces BLAST
>>>>>>>>>> output
>>>>>>>>>> to having a write.MultipleAlignment() function that produces BLAST
>>>>>>>>>> output. For the specific case of BLAST output, would it make sense
>>>>>>>>>> to support it for MultipleAlignment? Can someone point me to an
>>>>>>>>>> example
>>>>>>>>>> of such output? Or even better, to the specs of such format?
>>>>>>>>>>
>>>>>>>>>> Note that right now there is the write.phylip() function in
>>>>>>>>>> Biostrings
>>>>>>>>>> for writing a MultipleAlignment object to a file but the Phylip
>>>>>>>>>> format
>>>>>>>>>> looks very different from the BLAST output:
>>>>>>>>>>
>>>>>>>>>> hpages at latitude:~$ head -n 20 phylip_test.txt
>>>>>>>>>> 9 2343
>>>>>>>>>> Mask 0000000000 0000000000 0000000000 0000000000 0000000000
>>>>>>>>>> Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
>>>>>>>>>> Chimp ---------- ---------- ---------- ---------- ----------
>>>>>>>>>> Cow ---------- ---------- ---------- ---------- ----------
>>>>>>>>>> Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
>>>>>>>>>> Rat ---------- ---------- ---------- ---------- ----------
>>>>>>>>>> Dog ---------- ---------- ---------- ---------- ----------
>>>>>>>>>> Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
>>>>>>>>>> Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
>>>>>>>>>>
>>>>>>>>>> 0000000000 0000000000 0000000000 0001111111 1111111111
>>>>>>>>>> CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
>>>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>>>> ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
>>>>>>>>>> CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
>>>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT
>>>>>>>>>> AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
>>>>>>>>>> CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> H.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thomas
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
>>>>>>>>>>>> Hi Martin,
>>>>>>>>>>>>
>>>>>>>>>>>> On 04/16/2012 04:06 AM, Martin Preusse wrote:
>>>>>>>>>>>>> Hi Charles,
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks! Your solution allows to print the two alignment
>>>>>>>>>>>>> strings separately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking of an output as generated by alignment tools:
>>>>>>>>>>>>>
>>>>>>>>>>>>> AGT-TCTAT
>>>>>>>>>>>>> | | | | | | | | |
>>>>>>>>>>>>> AGTATCTAT
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This looks like BLAST output. Is this what you have in mind?
>>>>>>>>>>>> Note that
>>>>>>>>>>>> there are many alignment tools and many ways to output the
>>>>>>>>>>>> result to a
>>>>>>>>>>>> file. I'm not really familiar with the BLAST output format. Is it
>>>>>>>>>>>> specified somewhere? Would that make sense to add something
>>>>>>>>>>>> like a
>>>>>>>>>>>> write.PairwiseAlignedXStringSet() function to Biostrings for
>>>>>>>>>>>> writing
>>>>>>>>>>>> the result of pairwiseAlignment() to a file? We could do this and
>>>>>>>>>>>> support the BLAST format if that's a commonly used format.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> H.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> For this I would have to write a function to output the
>>>>>>>>>>>>> strings in blocks of e.g. 60 nucleotides, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> write.XStringSet
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>>>> Search the archives:
>>>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Hervé Pagès
>>>>>>>>>>>>
>>>>>>>>>>>> Program in Computational Biology
>>>>>>>>>>>> Division of Public Health Sciences
>>>>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>>>>> P.O. Box 19024
>>>>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>>>>
>>>>>>>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>>>>>>>> Phone: (206) 667-5791
>>>>>>>>>>>> Fax: (206) 667-1319
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>>> Search the archives:
>>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Hervé Pagès
>>>>>>>>>>
>>>>>>>>>> Program in Computational Biology
>>>>>>>>>> Division of Public Health Sciences
>>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>>> P.O. Box 19024
>>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>>
>>>>>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>>>>>> Phone: (206) 667-5791
>>>>>>>>>> Fax: (206) 667-1319
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>>>>>> Phone: (206) 667-5791
>>>>>> Fax: (206) 667-1319
>>>>>
>>>>
>>>
>>
>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list