[BioC] Performing basic Multiple Sequence Alignment in R?

Thomas Girke thomas.girke at ucr.edu
Tue Dec 21 17:39:07 CET 2010


To computer the multiple alignments, you need to use external programs
like dialign, ClustalW, T-Coffee, MUSCLE or Multalin. If you are familiar with
the command-line then you can execute them from R with the system() function,
and then import the resulting alignment files into R as XMultipleAlignment objects.
For details see: http://bioconductor.org/packages/release/bioc/html/Biostrings.html. 

For handling multiple alignements in R, you also may want take a look at some of these examples
http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Multiple-Sequence-Alignments-MSAs-
or the ape library at CRAN. 

Thomas

On Tue, Dec 21, 2010 at 11:29:05AM +0200, Tal Galili wrote:
> Hello everyone,
> 
> I am not sure if this should go on the general R mailing list (for example,
> if there is a text mining solution that might work here) or the bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the future I'll
> know better (in case it should go to only one of the two).
> 
> 
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to.  All that I know is that the
> "True" pattern should be of length "30" and that the sequences I'm looking
> at, have had missing values introduced to them at random points.
> Here is an example of such sequences, were on the left we see what is the
> real location of the missing values, and on the right we see the sequence
> that we will be able to observe.  My goal is to reconstruct the left column
> using only the sequences I've got on the right column (based on the fact
> that many of the letters in each position are the same)
> 
>                      Real_sequence           The_sequence_we_see
> 
> 1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 
> 2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
> 
> 3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
> 
> 4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
> 
> 5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 
> 6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
> 
> 7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG
> 
> 
> Here is an example code to reproduce the above example:
> 
> ATCG <- c("A","T","C","G")
> 
> set.seed(40)
> 
> 
> original.seq <- sample(ATCG, 30, T)
> 
> 
> seqS <- matrix(original.seq,200,30, T)
> 
> 
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
> 
> {
> 
> 
>     number.of.changes <- sample(seq_len(number.of.changes), 1)
> 
> 
>     new.letters <- sample(letters.to.change.with , number.of.changes, T)
> 
> 
>     where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
> 
> 
>     x[where.to.change.the.letters] <- new.letters
> 
> 
>     return(x)
> 
> }
> 
> 
> change.letters(original.seq)
> 
> 
> insert.missing.values <- function(x) change.letters(x, 3, "-")
> 
> 
> insert.missing.values(original.seq)
> 
> 
> seqS2 <- t(apply(seqS, 1, change.letters))
> 
> 
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
> 
> 
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
> 
> require(stringr)
> 
> # library(help=stringr)
> 
> 
> all.seqS <- str_replace(seqS4,"-" , "")
> 
> 
> # how do we allign this?
> 
> 
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
> 
> 
> I understand that if all I had was a string and a pattern I would be able to
> use
> 
> library(Biostrings)
> 
> 
> pairwiseAlignment(...)
> 
> 
> 
> But in the case I present we are dealing with many sequences to align to one
> another (instead of aligning them to one pattern).
> 
> Is there a known method for doing this in R?
> 
> 
> Thanks,
> 
> Tal
> 
> 
> 
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list