[BioC] summarizeOverlaps question

Wed Aug 29 23:37:02 CEST 2012

Hi Jim,

I think what you're seeing is due to how the annotation is provided. In 
the first example there are many transcripts the reads can hit and in 
the second, there is only one. When a GRangesList is given as 'subject' 
each outer list element is considered the feature. So in the first 
example we have 55419 features,

 > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
 > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
 > class(ex)
[1] "GRangesList"
attr(,"package")
[1] "GenomicRanges"
 > length(ex)
[1] 55419

In the second we have one,
 > ex2 <- ex["uc007amp.2"]
 > class(ex2)
[1] "GRangesList"
attr(,"package")
[1] "GenomicRanges"
 > length(ex2)
[1] 1

summarizeOverlaps() will either discard reads that hit >1 feature 
('Union') or it will make a decision about how to assign the read to a 
single feature if possible ('IntersectionStrict' or 
'IntersectionNotEmpty'). I believe in the first case many of your reads 
hit > 1 transcript and cannot be resolved using 'IntersectionNotEmpty'. 
When a read can't be resolved it is dropped. In the second case no reads 
should be dropped due to multi-hits because there is only 1 possible 
feature to hit.

Let me know if this does not clear things up and maybe I can take a 
closer look at your data.

Valerie

On 08/29/2012 09:39 AM, James W. MacDonald wrote:
> I am getting results from summarizeOverlaps(<GrangesList>, 
> <BamFileList>) that I don't understand. Here is the way I normally use 
> the function:
>
> library(Rsamtools)
> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
> samps_casava <- BamFileList(<path to bamfiles>)
> olaps <- summarizeOverlaps(ex, samps_casava, mode = 
> "IntersectionNotEmpty")
>
> and then as an example, check the counts for uc007amp.2
>
> > assays(olaps)$counts["uc007amp.2",, drop=F]
>            GG1_ATCACG_L002.uniques.sorted.bam
> uc007amp.2                                  1
>            GG2_TTAGGC_L002.uniques.sorted.bam
> uc007amp.2                                  1
>            GG3_ACTTGA_L002.uniques.sorted.bam
> uc007amp.2                                  0
>            GG4_GATCAG_L002.uniques.sorted.bam
> uc007amp.2                                  0
>            GG5_TAGCTT_L003.uniques.sorted.bam
> uc007amp.2                                  0
>            GG6_GGCTAC_L003.uniques.sorted.bam
> uc007amp.2                                  1
>            GG7_GTGGCC_L003.uniques.sorted.bam
> uc007amp.2                                  0
>            GG8_GTTTCG_L003.uniques.sorted.bam
> uc007amp.2                                  0
>            GG9_CGTACG_L004.uniques.sorted.bam
> uc007amp.2                                  0
>            GG10_GAGTGG_L004.uniques.sorted.bam
> uc007amp.2                                   0
>            GG11_ACTGAT_L004.uniques.sorted.bam
> uc007amp.2                                   1
>            GG12_ATTCCT_L004.uniques.sorted.bam
> uc007amp.2                                   3
>
> However, this doesn't make any sense to me, as there are actually tons 
> of reads that overlap this transcript (I know this from looking at the 
> transcript and these data using IGV). So let's just summarize the 
> overlaps for that transcript alone.
>
> > ex2 <- ex["uc007amp.2"]
> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = 
> "IntersectionNotEmpty")
> > assays(olaps2)$counts
>            GG1_ATCACG_L002.uniques.sorted.bam
> uc007amp.2                                 45
>            GG2_TTAGGC_L002.uniques.sorted.bam
> uc007amp.2                                 90
>            GG3_ACTTGA_L002.uniques.sorted.bam
> uc007amp.2                                 70
>            GG4_GATCAG_L002.uniques.sorted.bam
> uc007amp.2                                 64
>            GG5_TAGCTT_L003.uniques.sorted.bam
> uc007amp.2                                111
>            GG6_GGCTAC_L003.uniques.sorted.bam
> uc007amp.2                                 26
>            GG7_GTGGCC_L003.uniques.sorted.bam
> uc007amp.2                                 34
>            GG8_GTTTCG_L003.uniques.sorted.bam
> uc007amp.2                                 36
>            GG9_CGTACG_L004.uniques.sorted.bam
> uc007amp.2                                 98
>            GG10_GAGTGG_L004.uniques.sorted.bam
> uc007amp.2                                 153
>            GG11_ACTGAT_L004.uniques.sorted.bam
> uc007amp.2                                 164
>            GG12_ATTCCT_L004.uniques.sorted.bam
> uc007amp.2                                 174
>
>
> Any ideas where I am going wrong?
>
> Best,
>
> Jim
>
>