[BioC] Parse sequences so they align properly

Martin Morgan mtmorgan at fhcrc.org
Sun Jan 5 00:14:40 CET 2014


On 01/04/2014 02:05 PM, Joshua Banta wrote:
> Dear Hervé,
>
> Thank you very much for your response.
>
> Your script has a line of code that reads:
>
>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
>    stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> I take it this means that your functions do not allow for polymorphisms and other variations. For instance:
>
> ref.sequence <- "GCAAATT"
> sequences <- read.table(textConnection("name sequence
> sequence1 GC-AATT
> sequence2 GCAAAT[T]T
> sequence3 GCAAAGT"), header = TRUE, as.is<http://as.is> = TRUE)
> closeAllConnections()
>
> Sequence1 has a deletion (relative to the reference sequence) at position 3, sequence 2 has an insertion (relative to the reference sequence) between positions 5 and 6, and sequence3 has a G instead of a T (relative to the reference sequence) at position 6. (I am numbering the positions from left to right, using the reference sequence to assign the position numbers.)
>
> What I want to get is this:
>
> reference.sequence: "GCAAAT-T
> sequence1: GC-AAT-T
> sequence2: GCAAATTT
> sequence3: GCAAAG-T

I took a different approach, not super efficient. Here's the original data, 
modified so that [GCC] is replaced by [G][C][C], with a single vector of 
'sequences', and the ref.sequence included as the first (I don't think it's 
handled specially, other than not having any insertions).

     ref.sequence <- "ATAGCCGCA"
     sequences <- c(ref.sequence,
                    "AT[G][C][C]AGCCG[T]CA",
                    "ATAGCCGC[C][A][C]A",
                    "AT[G][C][C]AGCCGCA")

I create two copies of the sequences, one with the gap characters replaced by 
'+' and another with the raw sequence (no [] marks)

     gapped <- gsub("\\[.\\]", "+", sequences)
     raw <- gsub("(\\[|\\])", "", sequences)

I'm going to look at each letter in each 'gapped' sequence and decide whether 
it's a gap or not, and build up 'result' strings that are made of either an 
induced gap or the actual character at that position. Here's the index into the 
character positions of the gapped string, the result strings that I'm going to 
build up, and a stopping criterion

     idx <- rep(1, length(gapped))
     result <- character(length(gapped))
     maxit <- nchar(gapped)

Now I'll look at each character of each string. When there are insertions, I'll 
either add '-' or the inserted ('raw') character to the result string, and 
increment the index into the string(s) with the insertion.

     repeat {
         if (all(idx > maxit))
             break
         ins <- substr(gapped, idx, idx) == "+"
         char <- substr(raw, idx, idx)
         if (any(ins)) {
             char <- ifelse(ins, char, "-")
             idx <- ifelse(ins, idx + 1, idx)
         } else
             idx <- idx + 1
         result <- sprintf("%s%s", result, char)
     }

and then print the result out

     cat(paste(result, collapse="\n"), "\n")
     ## AT---AGCCG-C---A
     ## ATGCCAGCCGTC---A
     ## AT---AGCCG-CCACA
     ## ATGCCAGCCG-C---A

or for your second example

     cat(paste(result, collapse="\n"), "\n")
     ## GCAAAT-T
     ## GC-AAT-T
     ## GCAAATTT
     ## GCAAAG-T

>
> When I apply your functions to this new data, I get the following errors:
>
>> stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
>    'sequences[1]' is not compatible with 'ref.sequence'
>
>> stretchSequences(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
>    'sequences[1]' is not compatible with 'ref.sequence'
>
> I tried deleting the following lines of code from your script:
>
>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
>    stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> Reapplying the functions, I get:
>
>> stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAA-TT"
>> #not correct
>
>> stretchSequences(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAT-T"  "GC-AAATTT" "GC-AAA-GT"
> Warning messages:
> 1: In .Method(..., na.rm = na.rm) :
>    an argument will be fractionally recycled
> 2: In ins_sizes_in_ref - ins_sizes :
>    longer object length is not a multiple of shorter object length
> 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) :
>    longer object length is not a multiple of shorter object length
>> #also not correct
>
> Is there a way to modify your codes to make my new toy data give me the answers I'm looking for? Again, I really appreciate your help, because this parsing syntax is way over my head at the moment.
>
> -----------------------------------
> Josh Banta, Ph.D
> Assistant Professor
> Department of Biology
> The University of Texas at Tyler
> Tyler, TX 75799
> Tel: (903) 565-5655
> http://plantevolutionaryecology.org
>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list