[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Thu Apr 11 04:09:44 CEST 2013

Hi Martin,

Yes, inter_feature=TRUE would maintain the current counting mode(s) that
prohibits counting of reads mapping to multiple features. This is a
special case of counting that is very useful for counting exonic regions
of genes. However, one also wants to be able to turn off this behavior
by ignoring inter feature overlaps (just like ignore.strand=T/F).
Otherwise we cannot use summarizeOverlaps along with its current modes
for important operations like transcript level counting because many
transcript variants from the same gene will mask each others reads when
inter_feature=TRUE. Providing the option to output the results from both
inter_feature=TRUE and inter_feature=FALSE could be a very sensible
solution and time saver for users working with new genomes/GFFs, where
one cannot trust every nested annotation for various reasons, and
inter_feature=TRUE can quickly become a very risky counting strategy
since it tends to erase counts. Every biologist will scream in your
face if the counter tells them that their favorite gene has zero counts
just because of some overlap with some annotation error:). 

For multiple range features stored in GRangesList objects, I would
currently favor making "inter_feature=FALSE" ignore the overlaps
occurring among different list components, but not necessarily those
within a list component (e.g. exon ranges of a gene). This way one
can benefit from the current infrastructure by restricting its feature 
overlap scope to the range sets stored within individual list components 
but ignoring those among different list components. Utilities like reduce 
and other range modifier functions could handle situations where one wants 
to ignore all feature overlaps within and among list components. However, 
I am sure there could be other solutions to this.

Long story short if inter_feature=TRUE/FALSE could be used in
combination with modes=Union/IntersectionStrict/IntersectionNonEmpty
resulting in six different counting flavors, I would be happy and, I am 
sure, many other Bioc users as well.

I hope this makes sense.

Thomas

On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote:
> On 04/10/2013 12:42 PM, Thomas Girke wrote:
> > Thanks. Adding an inter-feature unaware mode will be very helpful and also
> > broaden summarizeOverlaps' application spectrum for a lot of use cases.
> 
> I'm probably being quite dense here, and am mostly an outside observer. What I 
> hear you saying is that there are currently three modes -- Union, 
> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the 
> seven rows of figure 1 of
> 
>     vignette("summarizeOverlaps", package="GenomicRanges")
> 
> Let's say there is a flag inter_feature, and when its value is TRUE then the 
> current counting schemes are obtained. These modes differ in the way counting 
> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a 
> count scored where a '1' appears in the table below. With inter_feature=TRUE 
> reads are 'never counted more than once' ('Hits per read' is <= 1)
> 
> |----------------------+-----+-----------+-----------+---------------|
> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
> |----------------------+-----+-----------+-----------+---------------|
> | Union                |   5 | 1         | 0         | 1             |
> |                      |   6 | 0         | 0         | 0             |
> |                      |   7 | 0         | 0         | 0             |
> | IntersectionStrict   |   5 | 1         | 0         | 1             |
> |                      |   6 | 1         | 0         | 1             |
> |                      |   7 | 0         | 0         | 0             |
> | IntersectionNotEmpty |   5 | 1         | 0         | 1             |
> |                      |   6 | 1         | 0         | 1             |
> |                      |   7 | 0         | 0         | 0             |
> |----------------------+-----+-----------+-----------+---------------|
> 
> 
> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes 
> 'counted twice'
> 
> |----------------------+-----+-----------+-----------+---------------|
> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
> |----------------------+-----+-----------+-----------+---------------|
> | Union                |   5 |         1 |         0 |             1 |
> |                      |   6 |         1 |         1 |             2 |
> |                      |   7 |         1 |         1 |             2 |
> | IntersectionStrict   |   5 |         1 |         0 |             1 |
> |                      |   6 |         1 |         0 |             1 |
> |                      |   7 |         1 |         1 |             2 |
> | IntersectionNotEmpty |   5 |         1 |         0 |             1 |
> |                      |   6 |         1 |         1 |             2 |
> |                      |   7 |         1 |         1 |             2 |
> |----------------------+-----+-----------+-----------+---------------|
> 
> 
> Martin
> 
> >
> > Thomas
> >
> >
> > On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
> >> Hi Thomas,
> >>
> >> On 04/10/2013 09:23 AM, Thomas Girke wrote:
> >>> Valerie,
> >>>
> >>> Please see my inserted comments below.
> >>>
> >>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
> >>>> Ah, I see. You'd like to count with one of the existing modes but have
> >>>> the option to pick up counts for these inter-feature reads (fall
> >>>> completely 'within' >1 feature). These inter-feature reads would be
> >>>> double (triple, quadruple, etc.) counted. Essentially they would add one
> >>>> count to each feature they hit. Right?
> >>>
> >>> Correct. Perhaps let's discuss this with a very common example of
> >>> transcript-level counting rather than counting on the unified (virtual)
> >>> exonic regions of genes. With the current description provided in the
> >>> summarizeOverlaps vignette at
> >>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
> >>> I don't see how this can be achieved without falling back to using
> >>> countOverlaps without loosing the new counting modes provided by
> >>> summarizeOverlaps?
> >>
> >> It can't be achieved with the function as is but we could add an
> >> argument to handle this (as you suggested from the start). If
> >> 'inter-feature=TRUE' then these counts would be added to the counts
> >> already obtained from using a particular 'mode'. I will move ahead with
> >> implementing this argument.
> >>
> >>>
> >>>>
> >>>> One more thought about memory usage. If you are working with single-end
> >>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should
> >>>> work fine. The approach is slightly different with paired-end reads. Our
> >>>> current algorithm for pairing paired-end reads requires the whole file
> >>>> to be read into memory. A different approach is currently being
> >>>> developed but in the meantime you can take the qname-sorted approach.
> >>>> The Bam file will need to be sorted by qname and both the 'yieldSize'
> >>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
> >>>> example is on ?BamFile.
> >>>
> >>> Thanks for pointing this out. My fault that I didn't read through all
> >>> the linked documentation. Perhaps it may not be a bad idea to make the
> >>> memory restricted bam read instances the default setting in the future.
> >>> This will definitely help biologists using those utilities without
> >>> crashing their machines on the first attempt.
> >>
> >> Good suggestion, thanks.
> >>
> >> Valerie
> >>
> >>>
> >>> Thomas
> >>>
> >>>
> >>>>
> >>>> Valerie
> >>>>
> >>>>
> >>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >>>>> Thanks for the tip. I guess doing it this way reverses the counting mode
> >>>>> back to countOverlaps, but how can I use at the same time
> >>>>> "IntersectionStrict" or any of the other modes provided by
> >>>>> summarizeOverlaps if its mode argument is already used and countOverlaps
> >>>>> doesn't accept one?
> >>>>>
> >>>>> Thomas
> >>>>>
> >>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >>>>>> Thanks for the example. You're right, none of the modes will count a
> >>>>>> read falling completely within >1 feature.
> >>>>>>
> >>>>>> You can supply your own function as a 'mode'. Two requirements must be met:
> >>>>>>
> >>>>>> (1) The function must have 3 arguments that correspond to
> >>>>>>         'features', 'reads' and 'ignore.strand', in that order.
> >>>>>>
> >>>>>> (2) The function must return a vector of counts the same length
> >>>>>>         as 'features'
> >>>>>>
> >>>>>> Here is an example using countOverlaps() which I think gets at the
> >>>>>> counting you want.
> >>>>>>
> >>>>>> counter <- function(x, y,  ignore.strand)
> >>>>>>         countOverlaps(y, x, ignore.strand=ignore.strand)
> >>>>>>
> >>>>>>     > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >>>>>>          [,1]
> >>>>>> [1,]    1
> >>>>>> [2,]    1
> >>>>>>
> >>>>>>
> >>>>>> Valerie
> >>>>>>
> >>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >>>>>>> Hi Valerie,
> >>>>>>>
> >>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> >>>>>>> that we often need for cases like this:
> >>>>>>>
> >>>>>>> Read1    ----
> >>>>>>> F1 ----------------
> >>>>>>> F2   -----------
> >>>>>>>
> >>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2:
> >>>>>>>
> >>>>>>> F1: 1
> >>>>>>> F2: 1
> >>>>>>>
> >>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> >>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> >>>>>>> option is frequently a problem when working with poorly annotated genomes (high
> >>>>>>> uncertainty about hidden/incorrect feature overlaps) or various
> >>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> >>>>>>> assignments than not counting at all as shown above.
> >>>>>>>
> >>>>>>> ## Here is a code example illustrating the same case:
> >>>>>>> library(GenomicRanges); library(Rsamtools)
> >>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
> >>>>>>>           pos = as.integer(c(500)),
> >>>>>>>           cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> >>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> >>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> >>>>>>> gr <- c(gr1, gr2)
> >>>>>>>
> >>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero
> >>>>>>> ## because of its inter_feature_overlap awareness:
> >>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >>>>>>>          [,1]
> >>>>>>> [1,]    0
> >>>>>>> [2,]    0
> >>>>>>>
> >>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when
> >>>>>>> ## counting multiple range features.
> >>>>>>> countOverlaps(gr, rd)
> >>>>>>> [1] 1 1
> >>>>>>>
> >>>>>>>
> >>>>>>> Thomas
> >>>>>>>
> >>>>>>>> sessionInfo()
> >>>>>>> R version 3.0.0 (2013-04-03)
> >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>>>>>>
> >>>>>>> locale:
> >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>>>>>
> >>>>>>> attached base packages:
> >>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> >>>>>>> [8] base
> >>>>>>>
> >>>>>>> other attached packages:
> >>>>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
> >>>>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
> >>>>>>>
> >>>>>>> loaded via a namespace (and not attached):
> >>>>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> >>>>>>>> Hi Thomas,
> >>>>>>>>
> >>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>>>>>>>> Dear Valerie,
> >>>>>>>>>
> >>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
> >>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> >>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is
> >>>>>>>>> the more appropriate counting mode for a biological question, but then
> >>>>>>>>> double counting of reads mapping to multiple-range features is not
> >>>>>>>>> accounted for. It would be really nice to have such a feature-overlap
> >>>>>>>>> unaware option directly in summarizeOverlaps.
> >>>>>>>>
> >>>>>>>> No, we don't currently have an option to ignore feature-overlap. It
> >>>>>>>> sounds like you want to count with countOverlaps() but still want the
> >>>>>>>> double counting to be resolved. This is essentially what the other modes
> >>>>>>>> are doing so I must be missing something.
> >>>>>>>>
> >>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With
> >>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to
> >>>>>>>> see? Maybe you have another, more descriptive example?
> >>>>>>>>
> >>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>>>>>>>                          names=c("A", "B")))
> >>>>>>>>
> >>>>>>>>      > countOverlaps(features, reads)
> >>>>>>>> [1] 2 1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has
> >>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million
> >>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> >>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq
> >>>>>>>>> projects on a commodity compute cluster, it would be desirable if every
> >>>>>>>>> counting instance would consume not more than 5GB of RAM.
> >>>>>>>>
> >>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom
> >>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> >>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
> >>>>>>>> BamFile. This method also makes use of mclapply().
> >>>>>>>>
> >>>>>>>> Valerie
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks in advance for your help and suggestions,
> >>>>>>>>>
> >>>>>>>>> Thomas
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Bioconductor mailing list
> >>>>>>>>> Bioconductor at r-project.org
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>>>>>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793