[BioC] summarizeOverlaps question

Thu Aug 30 17:20:55 CEST 2012

Hi Valerie,

On 8/29/2012 5:37 PM, Valerie Obenchain wrote:
> Hi Jim,
>
> I think what you're seeing is due to how the annotation is provided. 
> In the first example there are many transcripts the reads can hit and 
> in the second, there is only one. When a GRangesList is given as 
> 'subject' each outer list element is considered the feature. So in the 
> first example we have 55419 features,
>
> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
> > class(ex)
> [1] "GRangesList"
> attr(,"package")
> [1] "GenomicRanges"
> > length(ex)
> [1] 55419
>
> In the second we have one,
> > ex2 <- ex["uc007amp.2"]
> > class(ex2)
> [1] "GRangesList"
> attr(,"package")
> [1] "GenomicRanges"
> > length(ex2)
> [1] 1
>
> summarizeOverlaps() will either discard reads that hit >1 feature 
> ('Union') or it will make a decision about how to assign the read to a 
> single feature if possible ('IntersectionStrict' or 
> 'IntersectionNotEmpty'). I believe in the first case many of your 
> reads hit > 1 transcript and cannot be resolved using 
> 'IntersectionNotEmpty'. When a read can't be resolved it is dropped. 
> In the second case no reads should be dropped due to multi-hits 
> because there is only 1 possible feature to hit.
>
> Let me know if this does not clear things up and maybe I can take a 
> closer look at your data.

Ah. An unexpected result. I didn't think this through as well as I 
should have. The HTSeq diagram in the GenomicRanges package (as well as 
the original at 
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) imply 
overlapping exons from different genes, so I didn't think about the 
repercussions of overlapping exons from the *same* gene in different 
transcripts.

Thanks,

Jim

>
> Valerie
>
>
> On 08/29/2012 09:39 AM, James W. MacDonald wrote:
>> I am getting results from summarizeOverlaps(<GrangesList>, 
>> <BamFileList>) that I don't understand. Here is the way I normally 
>> use the function:
>>
>> library(Rsamtools)
>> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
>> samps_casava <- BamFileList(<path to bamfiles>)
>> olaps <- summarizeOverlaps(ex, samps_casava, mode = 
>> "IntersectionNotEmpty")
>>
>> and then as an example, check the counts for uc007amp.2
>>
>> > assays(olaps)$counts["uc007amp.2",, drop=F]
>>            GG1_ATCACG_L002.uniques.sorted.bam
>> uc007amp.2                                  1
>>            GG2_TTAGGC_L002.uniques.sorted.bam
>> uc007amp.2                                  1
>>            GG3_ACTTGA_L002.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG4_GATCAG_L002.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG5_TAGCTT_L003.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG6_GGCTAC_L003.uniques.sorted.bam
>> uc007amp.2                                  1
>>            GG7_GTGGCC_L003.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG8_GTTTCG_L003.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG9_CGTACG_L004.uniques.sorted.bam
>> uc007amp.2                                  0
>>            GG10_GAGTGG_L004.uniques.sorted.bam
>> uc007amp.2                                   0
>>            GG11_ACTGAT_L004.uniques.sorted.bam
>> uc007amp.2                                   1
>>            GG12_ATTCCT_L004.uniques.sorted.bam
>> uc007amp.2                                   3
>>
>> However, this doesn't make any sense to me, as there are actually 
>> tons of reads that overlap this transcript (I know this from looking 
>> at the transcript and these data using IGV). So let's just summarize 
>> the overlaps for that transcript alone.
>>
>> > ex2 <- ex["uc007amp.2"]
>> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = 
>> "IntersectionNotEmpty")
>> > assays(olaps2)$counts
>>            GG1_ATCACG_L002.uniques.sorted.bam
>> uc007amp.2                                 45
>>            GG2_TTAGGC_L002.uniques.sorted.bam
>> uc007amp.2                                 90
>>            GG3_ACTTGA_L002.uniques.sorted.bam
>> uc007amp.2                                 70
>>            GG4_GATCAG_L002.uniques.sorted.bam
>> uc007amp.2                                 64
>>            GG5_TAGCTT_L003.uniques.sorted.bam
>> uc007amp.2                                111
>>            GG6_GGCTAC_L003.uniques.sorted.bam
>> uc007amp.2                                 26
>>            GG7_GTGGCC_L003.uniques.sorted.bam
>> uc007amp.2                                 34
>>            GG8_GTTTCG_L003.uniques.sorted.bam
>> uc007amp.2                                 36
>>            GG9_CGTACG_L004.uniques.sorted.bam
>> uc007amp.2                                 98
>>            GG10_GAGTGG_L004.uniques.sorted.bam
>> uc007amp.2                                 153
>>            GG11_ACTGAT_L004.uniques.sorted.bam
>> uc007amp.2                                 164
>>            GG12_ATTCCT_L004.uniques.sorted.bam
>> uc007amp.2                                 174
>>
>>
>> Any ideas where I am going wrong?
>>
>> Best,
>>
>> Jim
>>
>>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099