[BioC] The ambiguous alignment examples in findMateAlignment don't have to be dumped

Sat Jan 25 00:50:16 CET 2014

Hi Yun,

On 01/24/2014 02:01 PM, yun YAN wrote:
> Hi Hervé,
> Thanks for your tips on `readGAlignmentsList`. Unlike
> `readGAlignmentPairs`, it is not compatible with `first`, `last`,
> `left`, `right` function with feedbacks:
>
>     unable to find an inherited method for function ‘first’ for
>     signature ‘"GAlignmentsList"’
>
> I'd better speak out that the goal of my sub-modules here is to get the
> reads coverage on strand(+)/(-) separately as my RNA-seq library is
> strand-specific, therefore the small fraction of ambiguous reads are
> better not dumped. (BTW you are right the fraction is too small, 1340/13
> million alignments in my test case, to deserve computational resources).

If you want to compute coverage, you don't need to go thru the pairing
process. Just load them in a GAlignments object with readGAlignments()
and call coverage() on it. You won't get *exactly* the same coverage
vector as if you load the alignments with readGAlignmentPairs() because
the latter cannot pair everything. But it will be *very* close.

Alternatively, you can load them with readGAlignmentsList() and do
coverage(unlist(.)) on your GAlignmentsList object. That should give
you exactly the same result as when you load with readGAlignments().

>
>     bamData <- readGAlignmentsList("my_alignment.bam", ...)
>
>     signalOnPositiveStrand <- coverage(grglist(left(bamData),
>     drop.D.range=T))
>
>
> The psudo-code would make it in one-step if my understanding were right
> together with the compatibility of `readGAlignmentsList` and `left`.
> **Is there any other functions or methods that could help me to filter
> the first/second mate when processing the readGAlignmentsList object or
> dumped alignments retrieved by getDumpedAlignments?** Like this:
>
>     firstMate  <- bamData[mates(bamData) == 1, ]
>
>     secondMate <- bamData[mates(bamData) == 2, ]
>
>     firstMate  <- readsRetrived[mates(readsRetrived) == 1, ]
>
>     secondMate <- readsRetrived[mates(readsRetrived) == 2, ]
>

Like scanBam(), which they rely on, the readGAlignment* functions can
take a ScanBamParam object thru their 'param' argument. Using
a ScanBamParam object lets you filter the alignments based on strand,
1st/2nd mate, etc... See ?ScanBamParam for more info.

>
> I do have other three strategies:
> 1> back to root, the `scanBam` function with the following parameters.
> But it is slow (as a result some guys prefer ShortRead package and
> python pysam package), and returns a list, the fundamental R class, thus
> I'd better not recreate wheels to dig out information, construct data
> table, etc.
>
>     isFirstMatePrmt <- ScanBamParam(flag = scanBamFlag(isPaired=TRUE,
>     isProperPair=TRUE, isFirstMateRead=TRUE), what=scanBamWhat())

So you know about ScanBamParam() and its 'flag' arg. Now use that with
any of the readGAlignment* functions. Note that, when using these
function, some flags are implicit e.g. readGAlignmentPairs()
automatically sets the isPaired flag to TRUE for you.

Cheers,
H.

>
> 2> feed R with the first/second mate bam by using samtools. But it would
> cost extra disk storage and not strongly dependent on R.
> 3> betray to HTSeq, a python package, the worst choice.
>
> Cheers,
>
> Yun
>
> On Thu, Jan 23, 2014 at 7:25 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Yun,
>
>
>     On 01/23/2014 12:56 PM, yun YAN wrote:
>
>         Hi Hervé Pagès and other maintainers in Bioconductor,
>
>         Thanks for your package instruction
>         (http://140.107.3.20/packages/__2.14/bioc/manuals/__GenomicAlignments/man/__GenomicAlignments.pdf
>         <http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf>)
>         so that I could understand the how does it work. Here you
>         mentioned the
>         dumped alignments but I was wondering whether they should be dumped?
>
>              seqnames strand cigar qwidth start end
>              <Rle> <Rle> <character> <integer> <integer> <integer>
>              SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
>              mate/read2 ...
>              SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
>              mate/read2 ...
>              SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st
>              mate/read1 ...
>              SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st
>              mate/read1 ...
>
>
>         For them you mentioned:
>
>              pairs could be rec(1) <-> rec(3) and rec(2) <->
>              rec(4), or they could be rec(1) <-> rec(4) and rec(2) <->
>         rec(3).
>              There is no way to disambiguate!
>
>
>         I agree the two pairing strategies above are ambiguous in light of
>         codes/computation. But it is fact that rec(1) and rec(2) are exactly
>         same,
>
>
>     Yes in that case. But it's not always the case. Maybe I should put an
>     example in this man page (i.e. the man page for findMateAlignment) where
>     the 2 records are not the same.
>
>     Note that in order to decide whether they are exactly the same or
>     not, one would need to look at all the fixed fields in the 2 records
>     plus
>     all the possible tags (there can be an arbitrary number of them).
>     There is a computational cost associated to this and I don't want to
>     impose that cost by default just to save a tiny fraction of ambiguous
>     pairs. Not worth it. My understanding is that it's not even possible
>     at the moment to load all the tags with Rsamtools (unless you already
>     know the tags but I'm not aware of an easy way to discover that).
>
>         thusthese two alignments are actually one type of multiple
>
>         alignment, though wired but potentially useful. The factor that
>         leads to
>         the multiple mappable positions of one mate of the fragment and
>         fixed
>         position of the other mate could probably:
>         a> the reference sequence similarity between these
>         two neighboring regions together with different splicing
>         isoforms. Here
>         first mate could have 2 mapped positions. # is exon while = is
>         intron.
>
>         2nd mate:                            ----  ----
>         1st mate:   ----      ----
>         1st mate:     ----    ----
>         isoform1: ==####======#####==========###__#==####===
>         isoform2: ==######====#####==========###__#==####===
>
>         b> or just aligner artifact, etc. Therefore I cannot agree your
>         package
>         dumps those alignments.
>
>
>     If you want to keep the ambiguous alignments, you can use
>     readGAlignmentsList() instead of readGAlignmentPairs().
>     See ?readGAlignmentsList for the details.
>
>
>         It needs better function name, at least, to
>         retrieve them.
>
>
>     The utility to retrieve them is called getDumpedAlignments(). What do
>     you propose?
>
>     Thanks,
>     H.
>
>
>         Regards,
>         Yun
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> YAN Yun, Master of Science
> RNA Lab, 6101
> Molecular Biology and Biochemistry Department
> College of Life Sciences, Wuhan University
> Wuhan, Hubei, P.R.China

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319