[BioC] List all alignments in Biostrings

Hervé Pagès hpages at fhcrc.org
Tue Mar 25 21:06:14 CET 2014


On 03/25/2014 11:24 AM, Alpesh Querer wrote:
> Thank you Hervé  . so if I want to determine to determine number of
> local alignments between two strings no less than length 4, is that
> possible in biostrings?

By "local alignments no less than length 4" do you mean "common
substrings of length >= 4"?

> The alternative is create sliding window fragments  from one sequence
> and use matchPattern ?
>    subject1: TTAAACGTAAAATGAATAAGCCT
>    subject:2 GGAAACGTGAATTGAATAAGCG

Yes you could try doing something like that. Create the views (aka
sliding window) on the first sequence:

   s1 <- DNAString("TTAAACGTAAAATGAATAAGCCT")
   s1views <- Views(s1, IRanges(seq_len(length(s1)-3), width=4))

Then use matchPDict() instead of matchPattern(). This will be faster
than calling matchPattern() on each view:

   m <- matchPDict(PDict(s1views), s2)
   start_in_s1 <- rep(start(s1views), elementLengths(m))
   start_in_s2 <- unlist(start(m))
   seq <- as.character(s1views[start_in_s1])
   hits0 <- data.frame(start_in_s1, start_in_s2, width=4L, seq)

Then:

   > hits0
      start_in_s1 start_in_s2 width  seq
   1            3           3     4 AAAC
   2            4           4     4 AACG
   3            5           5     4 ACGT
   4           13           8     4 TGAA
   5           13          13     4 TGAA
   6           14           9     4 GAAT
   7           14          14     4 GAAT
   8           15          15     4 AATA
   9           16          16     4 ATAA
   10          17          17     4 TAAG
   11          18          18     4 AAGC

Hits that are at consecutive positions in both 'start_in_s1' and
'start_in_s2' should probably be merged in a single longer hit
(they clearly correspond to the same hit). This can be done with
something like:

   mergeHits <- function(hits)
   {
     m <- IRanges:::matchIntegerPairs(hits$start_in_s1 - 1L,
                                      hits$start_in_s2 - 1L,
                                      hits$start_in_s1,
                                      hits$start_in_s2)
     merge_idx <- which(!is.na(m))
     if (length(merge_idx) == 0L)
       return(hits)
     ## This for loop can be very slow if there are a lot of hits
     ## to merge. There must be a better way!
     for (i in merge_idx) {
       mi <- m[i]
       mj <- m[mi]
       if (!is.na(mj))
         m[i] <- mj
     }
     hits$width <- hits$width + tabulate(m, nbins=length(m))
     hits[-merge_idx, c("start_in_s1", "start_in_s2", "width")]
   }

   hits <- mergeHits(hits0)

Then:

   > hits
     start_in_s1 start_in_s2 width
   1           3           3     6
   4          13           8     5
   5          13          13     9

Putting back the hit sequences (they can be extracted either from
s1 or s2, it doesn't matter, that should give us exactly the same
sequences):

   hits$seq <- as.character(Views(s1, IRanges(hits$start_in_s1, 
width=hits$width)))

Then:

   > hits
     start_in_s1 start_in_s2 width       seq
   1           3           3     6    AAACGT
   4          13           8     5     TGAAT
   5          13          13     9 TGAATAAGC

HTH,
H.

>
>
> On Tue, Mar 25, 2014 at 1:06 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Al,
>
>
>     On 03/25/2014 09:31 AM, Alpesh Querer wrote:
>
>         Hello,
>
>         I understand the pairwiseAlignment() returns the (first) best
>         match only,
>         is there a way for it to return all alignments for 2 sequences
>         both local
>         and global?
>
>
>     Short answer is no.
>
>     Long answer: I'm not sure what you'd like to get in case of a global
>     alignment ("global" is short for "global-global").
>     If you're doing "global-local" alignment, note that pairwiseAlignment()
>     returns the *last* best match, not the first:
>
>        pattern <- DNAString("AAA")
>        subject <- DNAString("TTAAACGTAAAATGAAT")
>
>     Then:
>
>        > pairwiseAlignment(pattern, subject, type="global-local")
>        Global-Local PairwiseAlignmentsSingleSubjec__t (1 of 1)
>        pattern:  [1] AAA
>        subject: [10] AAA
>        score: 5.945268
>
>     You can use matchPattern() to get all the matches that satisfy
>     some criteria (but this criteria must be expressed in terms of
>     edit distance):
>
>        > matchPattern(pattern, subject, max.mismatch=1, with.indels=TRUE)
>          Views on a 17-letter DNAString subject
>        subject: TTAAACGTAAAATGAAT
>        views:
>            start end width
>        [1]     3   5     3 [AAA]
>        [2]     9  11     3 [AAA]
>        [3]    10  12     3 [AAA]
>        [4]    15  16     2 [AA]
>
>     This doesn't give you a score though nor does it guarantee that you'll
>     get a match. However, if the returned Views object is not empty, then
>     it's guaranteed to contain all the "best" matches. Even though the Views
>     object can be post-processed to keep the best matches only, there is no
>     way to ask matchPattern() something like "give me the best matches
>     only".
>
>     HTH,
>     H.
>
>
>         Thanks,
>         Al
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list