[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Wed Apr 10 18:59:49 CEST 2013

Hi Thomas,

On 04/10/2013 09:23 AM, Thomas Girke wrote:
> Valerie,
>
> Please see my inserted comments below.
>
> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
>> Ah, I see. You'd like to count with one of the existing modes but have
>> the option to pick up counts for these inter-feature reads (fall
>> completely 'within' >1 feature). These inter-feature reads would be
>> double (triple, quadruple, etc.) counted. Essentially they would add one
>> count to each feature they hit. Right?
>
> Correct. Perhaps let's discuss this with a very common example of
> transcript-level counting rather than counting on the unified (virtual)
> exonic regions of genes. With the current description provided in the
> summarizeOverlaps vignette at
> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
> I don't see how this can be achieved without falling back to using
> countOverlaps without loosing the new counting modes provided by
> summarizeOverlaps?

It can't be achieved with the function as is but we could add an 
argument to handle this (as you suggested from the start). If 
'inter-feature=TRUE' then these counts would be added to the counts 
already obtained from using a particular 'mode'. I will move ahead with 
implementing this argument.

>
>>
>> One more thought about memory usage. If you are working with single-end
>> reads the summarizeOverlaps,BamFileList-method I mentioned below should
>> work fine. The approach is slightly different with paired-end reads. Our
>> current algorithm for pairing paired-end reads requires the whole file
>> to be read into memory. A different approach is currently being
>> developed but in the meantime you can take the qname-sorted approach.
>> The Bam file will need to be sorted by qname and both the 'yieldSize'
>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
>> example is on ?BamFile.
>
> Thanks for pointing this out. My fault that I didn't read through all
> the linked documentation. Perhaps it may not be a bad idea to make the
> memory restricted bam read instances the default setting in the future.
> This will definitely help biologists using those utilities without
> crashing their machines on the first attempt.

Good suggestion, thanks.

Valerie

>
> Thomas
>
>
>>
>> Valerie
>>
>>
>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
>>> Thanks for the tip. I guess doing it this way reverses the counting mode
>>> back to countOverlaps, but how can I use at the same time
>>> "IntersectionStrict" or any of the other modes provided by
>>> summarizeOverlaps if its mode argument is already used and countOverlaps
>>> doesn't accept one?
>>>
>>> Thomas
>>>
>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>>>> Thanks for the example. You're right, none of the modes will count a
>>>> read falling completely within >1 feature.
>>>>
>>>> You can supply your own function as a 'mode'. Two requirements must be met:
>>>>
>>>> (1) The function must have 3 arguments that correspond to
>>>>        'features', 'reads' and 'ignore.strand', in that order.
>>>>
>>>> (2) The function must return a vector of counts the same length
>>>>        as 'features'
>>>>
>>>> Here is an example using countOverlaps() which I think gets at the
>>>> counting you want.
>>>>
>>>> counter <- function(x, y,  ignore.strand)
>>>>        countOverlaps(y, x, ignore.strand=ignore.strand)
>>>>
>>>>    > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>>>>         [,1]
>>>> [1,]    1
>>>> [2,]    1
>>>>
>>>>
>>>> Valerie
>>>>
>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
>>>>> Hi Valerie,
>>>>>
>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
>>>>> that we often need for cases like this:
>>>>>
>>>>> Read1    ----
>>>>> F1 ----------------
>>>>> F2   -----------
>>>>>
>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2:
>>>>>
>>>>> F1: 1
>>>>> F2: 1
>>>>>
>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
>>>>> option is frequently a problem when working with poorly annotated genomes (high
>>>>> uncertainty about hidden/incorrect feature overlaps) or various
>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
>>>>> assignments than not counting at all as shown above.
>>>>>
>>>>> ## Here is a code example illustrating the same case:
>>>>> library(GenomicRanges); library(Rsamtools)
>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
>>>>>          pos = as.integer(c(500)),
>>>>>          cigar = rep("100M", 1), strand = strand(rep("+", 1)))
>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
>>>>> gr <- c(gr1, gr2)
>>>>>
>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero
>>>>> ## because of its inter_feature_overlap awareness:
>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
>>>>>         [,1]
>>>>> [1,]    0
>>>>> [2,]    0
>>>>>
>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when
>>>>> ## counting multiple range features.
>>>>> countOverlaps(gr, rd)
>>>>> [1] 1 1
>>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>>> sessionInfo()
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
>>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
>>>>>
>>>>>
>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
>>>>>> Hi Thomas,
>>>>>>
>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>>>>>>> Dear Valerie,
>>>>>>>
>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is
>>>>>>> the more appropriate counting mode for a biological question, but then
>>>>>>> double counting of reads mapping to multiple-range features is not
>>>>>>> accounted for. It would be really nice to have such a feature-overlap
>>>>>>> unaware option directly in summarizeOverlaps.
>>>>>>
>>>>>> No, we don't currently have an option to ignore feature-overlap. It
>>>>>> sounds like you want to count with countOverlaps() but still want the
>>>>>> double counting to be resolved. This is essentially what the other modes
>>>>>> are doing so I must be missing something.
>>>>>>
>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With
>>>>>> something like ignorefeature0L=TRUE, what results would you expect to
>>>>>> see? Maybe you have another, more descriptive example?
>>>>>>
>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>>>>>>                         names=c("A", "B")))
>>>>>>
>>>>>>     > countOverlaps(features, reads)
>>>>>> [1] 2 1
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has
>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million
>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq
>>>>>>> projects on a commodity compute cluster, it would be desirable if every
>>>>>>> counting instance would consume not more than 5GB of RAM.
>>>>>>
>>>>>> Have you tried the BamFileList-method? There is an example at the bottom
>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
>>>>>> BamFile. This method also makes use of mclapply().
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>>
>>>>>>> Thanks in advance for your help and suggestions,
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>