[BioC] matchTable for pairwiseAlignment in Biostrings?

Valerie Obenchain vobencha at fhcrc.org
Fri Aug 30 17:42:54 CEST 2013


This response from Herve didn't get posted - email troubles while traveling.

Valerie


-------- Original Message --------
Subject: Re: [BioC] matchTable for pairwiseAlignment in Biostrings?
Date: Thu, 22 Aug 2013 23:58:51 -0700
From: Hervé Pagès <hpages at fhcrc.org>
To: Brian Herb <brianherb at jhmi.edu>
CC: Valerie Obenchain <vobencha at fhcrc.org>, bioconductor at r-project.org

Hi Brian,

Global-local alignments can also be described in terms of position
and cigar (POS and CIGAR fields in the SAM/BAM formats).

In your example:
   POS = 5
   CIGAR = 11M1D10M1I7M4I16M  (generated by hand, sorry)

The advantage of the POS + CIGAR representation is that one can make
use of the cigar utilities in GenomicRanges to implement something
equivalent to your matchTable() function:

   library(GenomicRanges)

   matchTable2 <- function(pos, cigar)
   {
     pat_ranges <- cigarRangesAlongQuerySpace(cigar, with.ops=TRUE)[[1]]
     sub_ranges <- cigarRangesAlongReferenceSpace(cigar, pos=pos,
                                                  with.ops=TRUE)[[1]]
     ops <- names(pat_ranges)  # guaranteed to be the same as 
names(sub_ranges)
     I_idx <- which(ops == "I")
     start_idx <- c(1L, I_idx + 1L)
     end_idx <- c(I_idx - 1L, length(pat_ranges))
     PatStart <- start(pat_ranges)[start_idx]
     PatEnd <- end(pat_ranges)[end_idx]
     SubStart <- start(sub_ranges)[start_idx]
     SubEnd <- end(sub_ranges)[end_idx]
     data.frame(PatStart, PatEnd, SubStart, SubEnd)
   }

Then:

   > matchTable2(5, "11M1D10M1I7M4I16M")
     PatStart PatEnd SubStart SubEnd
   1        1     21        5     26
   2       23     29       27     33
   3       34     49       34     49

So it sounds like it would maybe be helpful if we had a cigar() getter
that computes the cigars from a PairwiseAlignments object of type
"global-local". Once we have this, we could even support exporting
this kind of objects to the SAM/BAM format.

Cheers,
H.

On 08/22/2013 02:01 PM, Brian Herb wrote:
> Hi Valerie,
>
> Thank you for looking into this! I actually didn't see my post show up
> so I did a little more tinkering and came up with the following
> function, see what you think:
>
> (input is PairwiseAlignmentsSingleSubject)
>
> matchTable <- function(pwa) {
> p.start <- start(pattern(pwa))
> p.end <- end(pattern(pwa))
> s.start <- start(subject(pwa))
> s.end <- end(subject(pwa))
> inserts <- as.matrix(insertion(pwa)[[1]])
> dels <- as.matrix(deletion(pwa)[[1]])
> s.pos1 <- c(s.start,(s.start+inserts[,1]-1))
> p.pos1 <- c(p.start,(p.start + cumsum(inserts[,2])+cumsum(diff(s.pos1))))
> s.pos2=c(s.pos1[-1]-1,s.end)
> p.pos2=c((head(p.pos1,n=(length(p.pos1)-1))+diff(s.pos1)-1),p.end)
>
> for(i in 1:nrow(dels)){
> tmp=dels[i,]
> p.pos1[which(p.pos1>tmp[1])]=p.pos1[which(p.pos1>tmp[1])]-tmp[2]
> p.pos2[which(p.pos2>tmp[1])]=p.pos2[which(p.pos2>tmp[1])]-tmp[2]
> }
>
> p.pos2[length(p.pos2)]=p.end
>
> tot=cbind(p.pos1,p.pos2,(p.pos2-p.pos1),s.pos1,s.pos2,(s.pos2-s.pos1))
> colnames(tot)=c("PatStart","PatEnd","PatDist","SubStart","SubEnd","SubDist")
> tot
> }
>
> example:
>
>  > tmp1="GTATCGATACGTCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC"
>
>  > tmp2="CACTCGATCGATGCTCTACATGATCGTGCAGCAACTGATGCATGACTGC"
>
>  > test=pairwiseAlignment(pattern=tmp1,subject=tmp2,type="global-local")
>
>  > matchTable(test)
>       PatStart PatEnd PatDist SubStart SubEnd SubDist
> [1,]        1     21      20        5     26      21
> [2,]       23     29       6       27     33       6
> [3,]       34     49      15       34     49      15
> For my work I am aligning longer sequences (~1kb) and I care more about
> where chunks of these sequences align rather than exactly which bases
> match within these chunks. Since I'm comparing sequences from different
> species, I'm trying to find where sequence from one species matches to
> the other, allowing for insertions and deletions. The function could
> probably be written better, but that is the hazard of letting biologists
> learn R. Thanks again for your help.
>
> writePairwiseAlignment output is attached
>
>
>
>
> On Thu, Aug 22, 2013 at 4:38 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Brian,
>
>     I don't know of a function that creates such a table but I think you
>     can create this with the start() and end() accessors.
>
>     pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC"))
>     sub <- DNAString("__ACTTCACCAGCTGACCTCG")
>     res <- pairwiseAlignment(pat, sub)
>
>      > res
>     Global PairwiseAlignmentsSingleSubjec__t (1 of 3)
>     pattern: [1] CACCAGCT
>     subject: [5] CACCAGCT
>     score: -48.14596
>
>     DataFrame(subjectStart=start(__subject(res)),
>                subjectEnd=end(subject(res)),
>                patternStart=start(pattern(__res)),
>                patternEnd=end(pattern(res)))
>
>     DataFrame with 3 rows and 4 columns
>        subjectStart subjectEnd patternStart patternEnd
>           <integer>  <integer>    <integer>  <integer>
>     1            5         12            1          8
>     2            3          8            1          6
>     3            1          5            1          5
>
>
>     Valerie
>
>
>     On 08/21/2013 08:40 AM, Brian Herb wrote:
>
>         Hi all,
>
>         I feel like I'm missing some obvious function, but is there a
>         simple way to
>         output a table of matching start and end positions between the
>         subject and
>         pattern in a pairwiseAlignment result (say from a global-local
>         alignment)?
>         For example:
>
>             s.pos1 s.pos2 p.pos1 p.pos2
>            [1,]    354    359      1      6
>            [2,]    360    364      9     19
>
>         where s.pos1 and s.pos2 are the start and end of a section of
>         the subject
>         that corresponds to the start and end (p.pos1 and p.pos2)  of
>         the pattern,
>         taking into account insertions and deletions. I know the
>         insertion and
>         deletion information is easily available, but I haven't found a
>         function in
>         the literature that that summarizes them into a mummer like
>         output of
>         corresponding start and end positions.
>
>
>
>
>
>
>
> --
> Brian Herb, Ph.D.
> Johns Hopkins School of Medicine
> Dr. Andrew Feinberg Laboratory
> Rangos 580
> 855 N. Wolfe St.
> Baltimore, MD 21205
> Phone:410-614-3478
> Fax: 410-614-9819



More information about the Bioconductor mailing list