[BioC] homopolymer gap extension penalty in Biostrings

Hervé Pagès hpages at fhcrc.org
Wed Mar 7 01:56:52 CET 2012


Hi Noah,

On 03/02/2012 12:16 PM, Noah Hoffman wrote:
> Hello Herve,
>
> It has been a while... I hope you're doing well. I have a question about the implementation of needleman-wunsch in pairwiseAlignment. We have some ion torrent sequencing data. Like the 454, (except worse, apparently) it likes to insert runs of homopolymers. The alignments with the primers, etc would be improved considerably if we could specify a lower penalty for a gap extension when there is a homopolymeric insertion in the opposite strand (as opposed to an insertion that is not a homopolymer). That is, two different gap extension penalties. Is this possible in pairwiseAlignment?

I guess that would be technically possible even though certainly not 
straightforward. The Needleman-Wunsch / Smith-Waterman algos are
complicated algorithms and the current implementation in Biostrings
is from Patrick Aboyoun. Unfortunately I'm not familiar enough with
it to be able to tell exactly how hard it would be to add something
like the "homopolymeric insertion penalty" feature you propose but
my gut feeling is that it would probably require some significant
amount of work.

Just to clarify, why are you saying "in the opposite strand" when you
say:

   specify a lower penalty for a gap extension when there is a
   homopolymeric insertion in the opposite strand

Could you provide an example?

One workaround you could explore is to use a 2-pass algo (UNTESTED,
and I'm not sure it would do exactly what you want):

   1st pass: Use pairwiseAlignment() with a gap penalty as low as the
   penalty you'd like to put on homopolymeric insertions. You said
   you were using Needleman-Wunsch so I'm assuming that you are doing
   "global-global" alignment so you probably don't need to work around
   one limitation of pairwiseAlignment() when doing "global-local" which
   is that it only returns the first best match.
   Let's assume you managed to obtain 1 alignment for each read with a
   score.

   2nd pass: Look at each alignment and in particular at the insertions
   they contain. Adjust the score of the alignment based on the type
   of insertions it has (i.e. homopolymeric or not). Patrick implemented
   tools for dissecting the alignments returned by pairwiseAlignment():
   they are described in the man page for PairwiseAlignedXStringSet
   objects (?PairwiseAlignedXStringSet).

>
> Another thing that would be useful would be per-strand gap creation and gap extension penalties.

Can you please elaborate on this? Depending on what you are trying to do
exactly, there might be an easy way to go or not. Please provide an
example.

>
> Any ideas? Fred Ross (who you may not have met yet) is working on this and would like to hear your answer as well.

I'm cc'ing the Bioconductor mailing list so other people can give some
advice. Maybe someone knows a tool in another BioC package or out of
Bioconductor that already provide those features?

Cheers,
H.

>
> Thanks a lot,
> Noah


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list