[BioC] [devteam-bioc] readGAlignmentPairsFromBam and readGAlignmentFromBam differences

Valerie Obenchain vobencha at fhcrc.org
Mon Feb 10 21:42:24 CET 2014


Hi,

On 02/10/2014 07:19 AM, Maintainer wrote:
>
> Hello,
>
> I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function,

Just to clarify, readGAlignmentsPairsFromBam() does not align reads. It 
reads in records that have been previously aligned.

see the code below.
> I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam

There is a family of functions available for reading in single and 
paired-end data. The functions with the *Bam extension are lower level 
and will probably go away (deprecated) in the next release cycle to 
eliminate redundancy. The functions below are the higher level counter 
parts that call the *Bam functions; these are the functions we recommend 
you use.

Single-end:
readGAlignments()

Paired-end"
readGAlignmentPairs()
readGAlignmentsList()

If you have paired-end data you can use either readGAlignmentPairs() or 
readGAlignmentsList(). These functions both use a mate-pairing algorithm 
but return different classes. In the devel branch the two call the same 
algorithm, in release they are different.

readGAlignmentPairs() returns a GAlignmentPairs class. This class can 
only hold pairs so any unmapped reads, singletons etc. are dumped. The 
dumped reads can be retrieved with getDumpedAlignments().

readGAlignmentsList() returns a List of GAlignments. The class can hold 
both pairs and records that cannot be paired. See ?readGAlignmentsList 
for details.


but it seems to have much more strict criteria for the pairing and 
therefore, the counts are much lower (mean number of counts: 185 versus 
440 for a random sample).
>
> I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me.
>
> example for one of these reads:
> $ samtools view 209_accepted_hits.bam  | grep  "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	97	chr15	77414942	50	51M	=	77415180	289	GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG	&&&(((((**(**++++++++++++++++++++)+++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	145	chr15	77415180	50	51M	=	77414942	-289	TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG	*+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$	AS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:20A30	YT:Z:UU	XS:A:-	NH:i:1
>
> is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam.

readGAlignmentsFromBam() returns both records because it treats all 
reads as single-end; there is no mate-paring here. The output of the two 
functions is not expected to be the same because one assumes single-end 
reads, the other assumes paired-end.

Are you saying that one of these reads above is not returned by 
readGAlignmentPairs()? If they are pairs (which it looks like they are) 
either both or neither should come back from readGAlignmentPairs(). 
Remember the class has right and left pairs that can be accessed with 
right() and left(). If neither are returned then they did not meet the 
mate-pairing criteria. As an fyi, you can investigate bam flags further 
with the bamFlagAsBitMatrix() function.

> Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria.
>
> Also I get some reads in readGAlignmentPairsFromBam that I don’t get from readGAlignmentFromBam, which again I don’t understand why.

readGAlignmentPairsFromBam() returns records that satisfy a mate-pairing 
algorithm. readGAlignmentFromBam() treats all records as single-end; 
there is no mate-pairing criteria in this function so all records are 
returned (sans those you filter out with ScanBamParam).

>
> Example for one of these reads:
> $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	161	chr15	77426477	50	51M	=	77426641	215	TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT	%$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	81	chr15	77426641	50	51M	=	77426477	-215	GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG	++++++++++++++++++++++++++*+++++++++++*****(((((&&&	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	NH:i:1
>

Can you try reading the data in with readGAlignmentsList()? This class 
will hold pairs and records that did not meet the pairing criteria.

Using GenomicRanges 1.14.4 in release:

library(GenomicRanges)
library(Rsamtools)
library(pasillaBamSubset)
fl <- untreated3_chr4()
galist <- readGAlignmentsList(fl)

The metadata column 'mates' indicates if they passed the mate criteria.
>> head(galist, 2)
> GAlignmentsList of length 2:
> $1
> GAlignments with 2 alignments and 1 metadata column:
>       seqnames strand cigar qwidth start end width ngap | mates
>   [1]     chr4      +   37M     37   169 205    37    0 |  TRUE
>   [2]     chr4      -   37M     37   326 362    37    0 |  TRUE
>
> $2
> GAlignments with 2 alignments and 1 metadata column:
>       seqnames strand cigar qwidth start  end width ngap | mates
>   [1]     chr4      +   37M     37   946  982    37    0 |  TRUE
>   [2]     chr4      -   37M     37   986 1022    37    0 |  TRUE
>

Value is FALSE for records that did not pass.

>> tail(galist, 2)
> GAlignmentsList of length 2:
> $95788
> GAlignments with 2 alignments and 1 metadata column:
>       seqnames strand cigar qwidth start   end width ngap | mates
>   [1]     chr4      +   37M     37 87190 87226    37    0 | FALSE
>   [2]     chr4      -   37M     37 87476 87512    37    0 | FALSE
>
> $95789
> GAlignments with 2 alignments and 1 metadata column:
>       seqnames strand cigar qwidth start   end width ngap | mates
>   [1]     chr4      +   37M     37 12491 12527    37    0 | FALSE
>   [2]     chr4      -   37M     37 12667 12703    37    0 | FALSE

If these pairs appear in the output from readGAlignmentsList() with 
mates=FALSE then they did not meet some aspect of the criteria. In this 
case the records can also be found with getDumpedAlignments() after 
runing readGAlignmentPairs().

Let me know how it goes.

Valerie


> Can someone, please, explain why this is?
> Thank you in advance for your time.
> Emmanouela
>
>
>   -- output of sessionInfo():
>
>
>
> library(GenomicRanges)
> library(Rsamtools)
> library(GenomicFeatures)
> library(rtracklayer)
>
> txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
> range.exon <- exonsBy(txdb.ensgene, "gene")
>
> my_bam <- "209_accepted_hits.bam.fltrd.bam"
> aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
> aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)
>
> temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
> temp3_aligns <- aligns[subjectHits(temp3)]
>
> temp4 <- findOverlaps(range.exon["328561"],aligns_s)
> temp4_aligns <- aligns_s[subjectHits(temp4)]
>
> setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
>
> setdiff(names(temp4_aligns),names(temp3_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
>
>
>
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.ISO-8859-1       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.ISO-8859-1        LC_COLLATE=en_GB.ISO-8859-1
>   [5] LC_MONETARY=en_GB.ISO-8859-1    LC_MESSAGES=en_GB.ISO-8859-1
>   [7] LC_PAPER=C                      LC_NAME=C
>   [9] LC_ADDRESS=C                    LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] org.Mm.eg.db_2.10.1    RSQLite_0.11.4         DBI_0.2-7
>   [4] annotate_1.40.0        rtracklayer_1.22.1     GenomicFeatures_1.14.2
>   [7] AnnotationDbi_1.24.0   Biobase_2.22.0         Rsamtools_1.14.2
> [10] Biostrings_2.30.1      GenomicRanges_1.14.4   XVector_0.2.0
> [13] IRanges_1.20.6         BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.18.0  bitops_1.0-6    BSgenome_1.30.0 RCurl_1.95-4.1
> [5] stats4_3.0.1    tools_3.0.1     XML_3.98-1.1    xtable_1.7-1
> [9] zlibbioc_1.8.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>


-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319



More information about the Bioconductor mailing list