[BioC] The ambiguous alignment examples in findMateAlignment don't have to be dumped

Fri Jan 24 01:25:17 CET 2014

Hi Yun,

On 01/23/2014 12:56 PM, yun YAN wrote:
> Hi Hervé Pagès and other maintainers in Bioconductor,
>
> Thanks for your package instruction
> (http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf)
> so that I could understand the how does it work. Here you mentioned the
> dumped alignments but I was wondering whether they should be dumped?
>
>     seqnames strand cigar qwidth start end
>     <Rle> <Rle> <character> <integer> <integer> <integer>
>     SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
>     mate/read2 ...
>     SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
>     mate/read2 ...
>     SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st
>     mate/read1 ...
>     SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st
>     mate/read1 ...
>
>
> For them you mentioned:
>
>     pairs could be rec(1) <-> rec(3) and rec(2) <->
>     rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3).
>     There is no way to disambiguate!
>
>
> I agree the two pairing strategies above are ambiguous in light of
> codes/computation. But it is fact that rec(1) and rec(2) are exactly
> same,

Yes in that case. But it's not always the case. Maybe I should put an
example in this man page (i.e. the man page for findMateAlignment) where
the 2 records are not the same.

Note that in order to decide whether they are exactly the same or not, 
one would need to look at all the fixed fields in the 2 records plus
all the possible tags (there can be an arbitrary number of them).
There is a computational cost associated to this and I don't want to
impose that cost by default just to save a tiny fraction of ambiguous
pairs. Not worth it. My understanding is that it's not even possible
at the moment to load all the tags with Rsamtools (unless you already
know the tags but I'm not aware of an easy way to discover that).

> thusthese two alignments are actually one type of multiple
> alignment, though wired but potentially useful. The factor that leads to
> the multiple mappable positions of one mate of the fragment and fixed
> position of the other mate could probably:
> a> the reference sequence similarity between these
> two neighboring regions together with different splicing isoforms. Here
> first mate could have 2 mapped positions. # is exon while = is intron.
>
> 2nd mate:                            ----  ----
> 1st mate:   ----      ----
> 1st mate:     ----    ----
> isoform1: ==####======#####==========####==####===
> isoform2: ==######====#####==========####==####===
>
> b> or just aligner artifact, etc. Therefore I cannot agree your package
> dumps those alignments.

If you want to keep the ambiguous alignments, you can use
readGAlignmentsList() instead of readGAlignmentPairs().
See ?readGAlignmentsList for the details.

> It needs better function name, at least, to
> retrieve them.

The utility to retrieve them is called getDumpedAlignments(). What do
you propose?

Thanks,
H.

>
> Regards,
> Yun
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319