[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Wed Apr 10 21:42:09 CEST 2013

Thanks. Adding an inter-feature unaware mode will be very helpful and also 
broaden summarizeOverlaps' application spectrum for a lot of use cases.

Thomas

On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
> 
> On 04/10/2013 09:23 AM, Thomas Girke wrote:
> > Valerie,
> >
> > Please see my inserted comments below.
> >
> > On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
> >> Ah, I see. You'd like to count with one of the existing modes but have
> >> the option to pick up counts for these inter-feature reads (fall
> >> completely 'within' >1 feature). These inter-feature reads would be
> >> double (triple, quadruple, etc.) counted. Essentially they would add one
> >> count to each feature they hit. Right?
> >
> > Correct. Perhaps let's discuss this with a very common example of
> > transcript-level counting rather than counting on the unified (virtual)
> > exonic regions of genes. With the current description provided in the
> > summarizeOverlaps vignette at
> > http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
> > I don't see how this can be achieved without falling back to using
> > countOverlaps without loosing the new counting modes provided by
> > summarizeOverlaps?
> 
> It can't be achieved with the function as is but we could add an 
> argument to handle this (as you suggested from the start). If 
> 'inter-feature=TRUE' then these counts would be added to the counts 
> already obtained from using a particular 'mode'. I will move ahead with 
> implementing this argument.
> 
> >
> >>
> >> One more thought about memory usage. If you are working with single-end
> >> reads the summarizeOverlaps,BamFileList-method I mentioned below should
> >> work fine. The approach is slightly different with paired-end reads. Our
> >> current algorithm for pairing paired-end reads requires the whole file
> >> to be read into memory. A different approach is currently being
> >> developed but in the meantime you can take the qname-sorted approach.
> >> The Bam file will need to be sorted by qname and both the 'yieldSize'
> >> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
> >> example is on ?BamFile.
> >
> > Thanks for pointing this out. My fault that I didn't read through all
> > the linked documentation. Perhaps it may not be a bad idea to make the
> > memory restricted bam read instances the default setting in the future.
> > This will definitely help biologists using those utilities without
> > crashing their machines on the first attempt.
> 
> Good suggestion, thanks.
> 
> Valerie
> 
> >
> > Thomas
> >
> >
> >>
> >> Valerie
> >>
> >>
> >> On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >>> Thanks for the tip. I guess doing it this way reverses the counting mode
> >>> back to countOverlaps, but how can I use at the same time
> >>> "IntersectionStrict" or any of the other modes provided by
> >>> summarizeOverlaps if its mode argument is already used and countOverlaps
> >>> doesn't accept one?
> >>>
> >>> Thomas
> >>>
> >>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >>>> Thanks for the example. You're right, none of the modes will count a
> >>>> read falling completely within >1 feature.
> >>>>
> >>>> You can supply your own function as a 'mode'. Two requirements must be met:
> >>>>
> >>>> (1) The function must have 3 arguments that correspond to
> >>>>        'features', 'reads' and 'ignore.strand', in that order.
> >>>>
> >>>> (2) The function must return a vector of counts the same length
> >>>>        as 'features'
> >>>>
> >>>> Here is an example using countOverlaps() which I think gets at the
> >>>> counting you want.
> >>>>
> >>>> counter <- function(x, y,  ignore.strand)
> >>>>        countOverlaps(y, x, ignore.strand=ignore.strand)
> >>>>
> >>>>    > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >>>>         [,1]
> >>>> [1,]    1
> >>>> [2,]    1
> >>>>
> >>>>
> >>>> Valerie
> >>>>
> >>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >>>>> Hi Valerie,
> >>>>>
> >>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> >>>>> that we often need for cases like this:
> >>>>>
> >>>>> Read1    ----
> >>>>> F1 ----------------
> >>>>> F2   -----------
> >>>>>
> >>>>> where we would like to have at least the option to assign Read1 to both F1 and F2:
> >>>>>
> >>>>> F1: 1
> >>>>> F2: 1
> >>>>>
> >>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> >>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> >>>>> option is frequently a problem when working with poorly annotated genomes (high
> >>>>> uncertainty about hidden/incorrect feature overlaps) or various
> >>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> >>>>> assignments than not counting at all as shown above.
> >>>>>
> >>>>> ## Here is a code example illustrating the same case:
> >>>>> library(GenomicRanges); library(Rsamtools)
> >>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
> >>>>>          pos = as.integer(c(500)),
> >>>>>          cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> >>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> >>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> >>>>> gr <- c(gr1, gr2)
> >>>>>
> >>>>> ## All of the three current modes in summarizeOverlaps return a count of zero
> >>>>> ## because of its inter_feature_overlap awareness:
> >>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >>>>>         [,1]
> >>>>> [1,]    0
> >>>>> [2,]    0
> >>>>>
> >>>>> ## However, countOverlaps handles this correctly, but is not the best choice when
> >>>>> ## counting multiple range features.
> >>>>> countOverlaps(gr, rd)
> >>>>> [1] 1 1
> >>>>>
> >>>>>
> >>>>> Thomas
> >>>>>
> >>>>>> sessionInfo()
> >>>>> R version 3.0.0 (2013-04-03)
> >>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>>>>
> >>>>> locale:
> >>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>>>
> >>>>> attached base packages:
> >>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> >>>>> [8] base
> >>>>>
> >>>>> other attached packages:
> >>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
> >>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
> >>>>>
> >>>>> loaded via a namespace (and not attached):
> >>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> >>>>>> Hi Thomas,
> >>>>>>
> >>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>>>>>> Dear Valerie,
> >>>>>>>
> >>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
> >>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> >>>>>>> one can switch back to countOverlaps when feature overlap unawareness is
> >>>>>>> the more appropriate counting mode for a biological question, but then
> >>>>>>> double counting of reads mapping to multiple-range features is not
> >>>>>>> accounted for. It would be really nice to have such a feature-overlap
> >>>>>>> unaware option directly in summarizeOverlaps.
> >>>>>>
> >>>>>> No, we don't currently have an option to ignore feature-overlap. It
> >>>>>> sounds like you want to count with countOverlaps() but still want the
> >>>>>> double counting to be resolved. This is essentially what the other modes
> >>>>>> are doing so I must be missing something.
> >>>>>>
> >>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With
> >>>>>> something like ignorefeature0L=TRUE, what results would you expect to
> >>>>>> see? Maybe you have another, more descriptive example?
> >>>>>>
> >>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>>>>>                         names=c("A", "B")))
> >>>>>>
> >>>>>>     > countOverlaps(features, reads)
> >>>>>> [1] 2 1
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has
> >>>>>>> this been optimized yet? On a typical bam file with ~50-100 million
> >>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> >>>>>>> use the function on a desktop computer or in large-scale RNA-Seq
> >>>>>>> projects on a commodity compute cluster, it would be desirable if every
> >>>>>>> counting instance would consume not more than 5GB of RAM.
> >>>>>>
> >>>>>> Have you tried the BamFileList-method? There is an example at the bottom
> >>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> >>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
> >>>>>> BamFile. This method also makes use of mclapply().
> >>>>>>
> >>>>>> Valerie
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks in advance for your help and suggestions,
> >>>>>>>
> >>>>>>> Thomas
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioconductor mailing list
> >>>>>>> Bioconductor at r-project.org
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>>>