[BioC] readGAlignmentPairsFromBam and readGAlignmentFromBam differences

Mon Feb 10 16:19:19 CET 2014

Hello,

I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below. 
I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample). 

I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me. 

example for one of these reads:
$ samtools view 209_accepted_hits.bam  | grep  "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	97	chr15	77414942	50	51M	=	77415180	289	GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG	&&&(((((**(**++++++++++++++++++++)+++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	145	chr15	77415180	50	51M	=	77414942	-289	TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG	*+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$	AS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:20A30	YT:Z:UU	XS:A:-	NH:i:1

is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria. 

Also I get some reads in readGAlignmentPairsFromBam that I donâ€™t get from readGAlignmentFromBam, which again I donâ€™t understand why. 

Example for one of these reads:
$samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	161	chr15	77426477	50	51M	=	77426641	215	TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT	%$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	81	chr15	77426641	50	51M	=	77426477	-215	GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG	++++++++++++++++++++++++++*+++++++++++*****(((((&&&	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	NH:i:1

Can someone, please, explain why this is?  
Thank you in advance for your time.
Emmanouela

 -- output of sessionInfo(): 

library(GenomicRanges)
library(Rsamtools)
library(GenomicFeatures)
library(rtracklayer)

txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
range.exon <- exonsBy(txdb.ensgene, "gene")

my_bam <- "209_accepted_hits.bam.fltrd.bam"
aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)

temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
temp3_aligns <- aligns[subjectHits(temp3)]  

temp4 <- findOverlaps(range.exon["328561"],aligns_s) 
temp4_aligns <- aligns_s[subjectHits(temp4)]  

setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
[1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" 

setdiff(names(temp4_aligns),names(temp3_aligns))[1]
[1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" 

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.ISO-8859-1       LC_NUMERIC=C                   
 [3] LC_TIME=en_GB.ISO-8859-1        LC_COLLATE=en_GB.ISO-8859-1    
 [5] LC_MONETARY=en_GB.ISO-8859-1    LC_MESSAGES=en_GB.ISO-8859-1   
 [7] LC_PAPER=C                      LC_NAME=C                      
 [9] LC_ADDRESS=C                    LC_TELEPHONE=C                 
[11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C            

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] org.Mm.eg.db_2.10.1    RSQLite_0.11.4         DBI_0.2-7             
 [4] annotate_1.40.0        rtracklayer_1.22.1     GenomicFeatures_1.14.2
 [7] AnnotationDbi_1.24.0   Biobase_2.22.0         Rsamtools_1.14.2      
[10] Biostrings_2.30.1      GenomicRanges_1.14.4   XVector_0.2.0         
[13] IRanges_1.20.6         BiocGenerics_0.8.0    

loaded via a namespace (and not attached):
[1] biomaRt_2.18.0  bitops_1.0-6    BSgenome_1.30.0 RCurl_1.95-4.1 
[5] stats4_3.0.1    tools_3.0.1     XML_3.98-1.1    xtable_1.7-1   
[9] zlibbioc_1.8.0 

--
Sent via the guest posting facility at bioconductor.org.