[R] Performing basic Multiple Sequence Alignment in R?

Tue Dec 21 14:06:19 CET 2010

Tal; I'm trimming the BioC posting. In the R lists it is considered  
spamming to cross post. (Please re-read the Posting Guide.)

On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:

> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for  
> example,
> if there is a text mining solution that might work here) or the  
> bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the  
> future I'll
> know better (in case it should go to only one of the two).
>
>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to.  All that I know is that the
> "True" pattern should be of length "30" and that the sequences I'm  
> looking
> at, have had missing values introduced to them at random points.
> Here is an example of such sequences, were on the left we see what  
> is the
> real location of the missing values, and on the right we see the  
> sequence
> that we will be able to observe.  My goal is to reconstruct the left  
> column
> using only the sequences I've got on the right column (based on the  
> fact
> that many of the letters in each position are the same)
>
>                     Real_sequence           The_sequence_we_see
> 1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
> 3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
> 4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
> 5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
> 7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG
>

The agrep function allows one to specify which sort of differences to  
consider in calculating a Levenshtein edit distance. Insertions are  
one possible distance component. You could take a look at its code (in  
C in hte sources) and perhaps rejigger it to spit out the location of  
the deletions.

 > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,  
max.distance=list(deletions=0, substitutions=0, insertions=0))
integer(0)
 > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,  
max.distance=list(deletions=0, substitutions=0, insertions=1))
[1] 1

-- 
David.
>
> Here is an example code to reproduce the above example:
>
> ATCG <- c("A","T","C","G")
> set.seed(40)
>
> original.seq <- sample(ATCG, 30, T)
>
> seqS <- matrix(original.seq,200,30, T)
>
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
> {
>
>    number.of.changes <- sample(seq_len(number.of.changes), 1)
>
>    new.letters <- sample(letters.to.change.with , number.of.changes,  
> T)
>
>    where.to.change.the.letters <- sample(seq_along(x) ,  
> number.of.changes, F)
>
>    x[where.to.change.the.letters] <- new.letters
>
>    return(x)
> }
>
> change.letters(original.seq)
>
> insert.missing.values <- function(x) change.letters(x, 3, "-")
>
> insert.missing.values(original.seq)
>
> seqS2 <- t(apply(seqS, 1, change.letters))
>
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
>
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
> require(stringr)
> # library(help=stringr)
>
> all.seqS <- str_replace(seqS4,"-" , "")
>
> # how do we allign this?
>
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
>
>
> I understand that if all I had was a string and a pattern I would be  
> able to
> use
>
> library(Biostrings)
>
> pairwiseAlignment(...)
>
>
>
> But in the case I present we are dealing with many sequences to  
> align to one
> another (instead of aligning them to one pattern).
>
> Is there a known method for doing this in R?
>
>
> Thanks,
>
> Tal
>
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il  
> (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT