[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Thu Apr 11 04:55:23 CEST 2013

Dear Wei,

Thanks. I guess this makes sense, but is it possible this way to perform
transcript level counting in addition to exonic gene-level counting? 

Also, this reminds me of one question related to Rsubread for which I
couldn't find an answer in the documentation (haven't had time to read
your NAR paper though). Is it possible to use Rsubread for aligning
directly against transcriptome sequences where we want to set no
restrictions on multiple mappings of equal quality in order to obtain
transcript read counts? Right I am often doing this with Bowtie2 for our
own transcriptome assemblies where we don't have a reference genome and
then use Rsamtools for counting the reads mapping to individual
transcripts. With Rsubread I was not sure if the algorithm can return
any number of alternative hits? If it can then I would love to give it 
a try for our transcriptome assembly projects.

Thomas

On Thu, Apr 11, 2013 at 02:32:03AM +0000, Wei Shi wrote:
> Hi Tom,
> 
> The featureCounts function in Rsubread package goes some way like what you suggested. It is different from countOverlaps in that each read is counted at most once. It is also different from summarizeOverlaps in that it assigns the read to one of the features if it overlaps with >1 features.
> 
> featureCounts assigns reads to exons first and then summarize exon-level read counts to gene-level read counts. Therefore, if a read overlaps with >1 exons within the same gene it will be counted only once for the gene.
> 
> featureCounts includes in-built annotations for human and mouse genomes (NCBI RefSeq annotation), but you can provide your own annotation as well. You can also use it for the general-purpose read assignment. It is not restricted to the read counting for RNA-seq data. 
> 
> Cheers,
> Wei
> 
> On Apr 11, 2013, at 12:09 PM, Thomas Girke wrote:
> 
> > Hi Martin,
> > 
> > Yes, inter_feature=TRUE would maintain the current counting mode(s) that
> > prohibits counting of reads mapping to multiple features. This is a
> > special case of counting that is very useful for counting exonic regions
> > of genes. However, one also wants to be able to turn off this behavior
> > by ignoring inter feature overlaps (just like ignore.strand=T/F).
> > Otherwise we cannot use summarizeOverlaps along with its current modes
> > for important operations like transcript level counting because many
> > transcript variants from the same gene will mask each others reads when
> > inter_feature=TRUE. Providing the option to output the results from both
> > inter_feature=TRUE and inter_feature=FALSE could be a very sensible
> > solution and time saver for users working with new genomes/GFFs, where
> > one cannot trust every nested annotation for various reasons, and
> > inter_feature=TRUE can quickly become a very risky counting strategy
> > since it tends to erase counts. Every biologist will scream in your
> > face if the counter tells them that their favorite gene has zero counts
> > just because of some overlap with some annotation error:). 
> > 
> > For multiple range features stored in GRangesList objects, I would
> > currently favor making "inter_feature=FALSE" ignore the overlaps
> > occurring among different list components, but not necessarily those
> > within a list component (e.g. exon ranges of a gene). This way one
> > can benefit from the current infrastructure by restricting its feature 
> > overlap scope to the range sets stored within individual list components 
> > but ignoring those among different list components. Utilities like reduce 
> > and other range modifier functions could handle situations where one wants 
> > to ignore all feature overlaps within and among list components. However, 
> > I am sure there could be other solutions to this.
> > 
> > Long story short if inter_feature=TRUE/FALSE could be used in
> > combination with modes=Union/IntersectionStrict/IntersectionNonEmpty
> > resulting in six different counting flavors, I would be happy and, I am 
> > sure, many other Bioc users as well.
> > 
> > I hope this makes sense.
> > 
> > Thomas
> > 
> > 
> > 
> > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote:
> >> On 04/10/2013 12:42 PM, Thomas Girke wrote:
> >>> Thanks. Adding an inter-feature unaware mode will be very helpful and also
> >>> broaden summarizeOverlaps' application spectrum for a lot of use cases.
> >> 
> >> I'm probably being quite dense here, and am mostly an outside observer. What I 
> >> hear you saying is that there are currently three modes -- Union, 
> >> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the 
> >> seven rows of figure 1 of
> >> 
> >>    vignette("summarizeOverlaps", package="GenomicRanges")
> >> 
> >> Let's say there is a flag inter_feature, and when its value is TRUE then the 
> >> current counting schemes are obtained. These modes differ in the way counting 
> >> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a 
> >> count scored where a '1' appears in the table below. With inter_feature=TRUE 
> >> reads are 'never counted more than once' ('Hits per read' is <= 1)
> >> 
> >> |----------------------+-----+-----------+-----------+---------------|
> >> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
> >> |----------------------+-----+-----------+-----------+---------------|
> >> | Union                |   5 | 1         | 0         | 1             |
> >> |                      |   6 | 0         | 0         | 0             |
> >> |                      |   7 | 0         | 0         | 0             |
> >> | IntersectionStrict   |   5 | 1         | 0         | 1             |
> >> |                      |   6 | 1         | 0         | 1             |
> >> |                      |   7 | 0         | 0         | 0             |
> >> | IntersectionNotEmpty |   5 | 1         | 0         | 1             |
> >> |                      |   6 | 1         | 0         | 1             |
> >> |                      |   7 | 0         | 0         | 0             |
> >> |----------------------+-----+-----------+-----------+---------------|
> >> 
> >> 
> >> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes 
> >> 'counted twice'
> >> 
> >> |----------------------+-----+-----------+-----------+---------------|
> >> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
> >> |----------------------+-----+-----------+-----------+---------------|
> >> | Union                |   5 |         1 |         0 |             1 |
> >> |                      |   6 |         1 |         1 |             2 |
> >> |                      |   7 |         1 |         1 |             2 |
> >> | IntersectionStrict   |   5 |         1 |         0 |             1 |
> >> |                      |   6 |         1 |         0 |             1 |
> >> |                      |   7 |         1 |         1 |             2 |
> >> | IntersectionNotEmpty |   5 |         1 |         0 |             1 |
> >> |                      |   6 |         1 |         1 |             2 |
> >> |                      |   7 |         1 |         1 |             2 |
> >> |----------------------+-----+-----------+-----------+---------------|
> >> 
> >> 
> >> Martin
> >> 
> >>> 
> >>> Thomas
> >>> 
> >>> 
> >>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
> >>>> Hi Thomas,
> >>>> 
> >>>> On 04/10/2013 09:23 AM, Thomas Girke wrote:
> >>>>> Valerie,
> >>>>> 
> >>>>> Please see my inserted comments below.
> >>>>> 
> >>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
> >>>>>> Ah, I see. You'd like to count with one of the existing modes but have
> >>>>>> the option to pick up counts for these inter-feature reads (fall
> >>>>>> completely 'within' >1 feature). These inter-feature reads would be
> >>>>>> double (triple, quadruple, etc.) counted. Essentially they would add one
> >>>>>> count to each feature they hit. Right?
> >>>>> 
> >>>>> Correct. Perhaps let's discuss this with a very common example of
> >>>>> transcript-level counting rather than counting on the unified (virtual)
> >>>>> exonic regions of genes. With the current description provided in the
> >>>>> summarizeOverlaps vignette at
> >>>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
> >>>>> I don't see how this can be achieved without falling back to using
> >>>>> countOverlaps without loosing the new counting modes provided by
> >>>>> summarizeOverlaps?
> >>>> 
> >>>> It can't be achieved with the function as is but we could add an
> >>>> argument to handle this (as you suggested from the start). If
> >>>> 'inter-feature=TRUE' then these counts would be added to the counts
> >>>> already obtained from using a particular 'mode'. I will move ahead with
> >>>> implementing this argument.
> >>>> 
> >>>>> 
> >>>>>> 
> >>>>>> One more thought about memory usage. If you are working with single-end
> >>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should
> >>>>>> work fine. The approach is slightly different with paired-end reads. Our
> >>>>>> current algorithm for pairing paired-end reads requires the whole file
> >>>>>> to be read into memory. A different approach is currently being
> >>>>>> developed but in the meantime you can take the qname-sorted approach.
> >>>>>> The Bam file will need to be sorted by qname and both the 'yieldSize'
> >>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
> >>>>>> example is on ?BamFile.
> >>>>> 
> >>>>> Thanks for pointing this out. My fault that I didn't read through all
> >>>>> the linked documentation. Perhaps it may not be a bad idea to make the
> >>>>> memory restricted bam read instances the default setting in the future.
> >>>>> This will definitely help biologists using those utilities without
> >>>>> crashing their machines on the first attempt.
> >>>> 
> >>>> Good suggestion, thanks.
> >>>> 
> >>>> Valerie
> >>>> 
> >>>>> 
> >>>>> Thomas
> >>>>> 
> >>>>> 
> >>>>>> 
> >>>>>> Valerie
> >>>>>> 
> >>>>>> 
> >>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >>>>>>> Thanks for the tip. I guess doing it this way reverses the counting mode
> >>>>>>> back to countOverlaps, but how can I use at the same time
> >>>>>>> "IntersectionStrict" or any of the other modes provided by
> >>>>>>> summarizeOverlaps if its mode argument is already used and countOverlaps
> >>>>>>> doesn't accept one?
> >>>>>>> 
> >>>>>>> Thomas
> >>>>>>> 
> >>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >>>>>>>> Thanks for the example. You're right, none of the modes will count a
> >>>>>>>> read falling completely within >1 feature.
> >>>>>>>> 
> >>>>>>>> You can supply your own function as a 'mode'. Two requirements must be met:
> >>>>>>>> 
> >>>>>>>> (1) The function must have 3 arguments that correspond to
> >>>>>>>>        'features', 'reads' and 'ignore.strand', in that order.
> >>>>>>>> 
> >>>>>>>> (2) The function must return a vector of counts the same length
> >>>>>>>>        as 'features'
> >>>>>>>> 
> >>>>>>>> Here is an example using countOverlaps() which I think gets at the
> >>>>>>>> counting you want.
> >>>>>>>> 
> >>>>>>>> counter <- function(x, y,  ignore.strand)
> >>>>>>>>        countOverlaps(y, x, ignore.strand=ignore.strand)
> >>>>>>>> 
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >>>>>>>>         [,1]
> >>>>>>>> [1,]    1
> >>>>>>>> [2,]    1
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Valerie
> >>>>>>>> 
> >>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >>>>>>>>> Hi Valerie,
> >>>>>>>>> 
> >>>>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> >>>>>>>>> that we often need for cases like this:
> >>>>>>>>> 
> >>>>>>>>> Read1    ----
> >>>>>>>>> F1 ----------------
> >>>>>>>>> F2   -----------
> >>>>>>>>> 
> >>>>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2:
> >>>>>>>>> 
> >>>>>>>>> F1: 1
> >>>>>>>>> F2: 1
> >>>>>>>>> 
> >>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> >>>>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> >>>>>>>>> option is frequently a problem when working with poorly annotated genomes (high
> >>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various
> >>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> >>>>>>>>> assignments than not counting at all as shown above.
> >>>>>>>>> 
> >>>>>>>>> ## Here is a code example illustrating the same case:
> >>>>>>>>> library(GenomicRanges); library(Rsamtools)
> >>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
> >>>>>>>>>          pos = as.integer(c(500)),
> >>>>>>>>>          cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> >>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> >>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> >>>>>>>>> gr <- c(gr1, gr2)
> >>>>>>>>> 
> >>>>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero
> >>>>>>>>> ## because of its inter_feature_overlap awareness:
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >>>>>>>>>         [,1]
> >>>>>>>>> [1,]    0
> >>>>>>>>> [2,]    0
> >>>>>>>>> 
> >>>>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when
> >>>>>>>>> ## counting multiple range features.
> >>>>>>>>> countOverlaps(gr, rd)
> >>>>>>>>> [1] 1 1
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Thomas
> >>>>>>>>> 
> >>>>>>>>>> sessionInfo()
> >>>>>>>>> R version 3.0.0 (2013-04-03)
> >>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>>>>>>>> 
> >>>>>>>>> locale:
> >>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>>>>>>> 
> >>>>>>>>> attached base packages:
> >>>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> >>>>>>>>> [8] base
> >>>>>>>>> 
> >>>>>>>>> other attached packages:
> >>>>>>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
> >>>>>>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
> >>>>>>>>> 
> >>>>>>>>> loaded via a namespace (and not attached):
> >>>>>>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> >>>>>>>>>> Hi Thomas,
> >>>>>>>>>> 
> >>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>>>>>>>>>> Dear Valerie,
> >>>>>>>>>>> 
> >>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
> >>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> >>>>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is
> >>>>>>>>>>> the more appropriate counting mode for a biological question, but then
> >>>>>>>>>>> double counting of reads mapping to multiple-range features is not
> >>>>>>>>>>> accounted for. It would be really nice to have such a feature-overlap
> >>>>>>>>>>> unaware option directly in summarizeOverlaps.
> >>>>>>>>>> 
> >>>>>>>>>> No, we don't currently have an option to ignore feature-overlap. It
> >>>>>>>>>> sounds like you want to count with countOverlaps() but still want the
> >>>>>>>>>> double counting to be resolved. This is essentially what the other modes
> >>>>>>>>>> are doing so I must be missing something.
> >>>>>>>>>> 
> >>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With
> >>>>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to
> >>>>>>>>>> see? Maybe you have another, more descriptive example?
> >>>>>>>>>> 
> >>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>>>>>>>>>                         names=c("A", "B")))
> >>>>>>>>>> 
> >>>>>>>>>>> countOverlaps(features, reads)
> >>>>>>>>>> [1] 2 1
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has
> >>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million
> >>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> >>>>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq
> >>>>>>>>>>> projects on a commodity compute cluster, it would be desirable if every
> >>>>>>>>>>> counting instance would consume not more than 5GB of RAM.
> >>>>>>>>>> 
> >>>>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom
> >>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> >>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
> >>>>>>>>>> BamFile. This method also makes use of mclapply().
> >>>>>>>>>> 
> >>>>>>>>>> Valerie
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks in advance for your help and suggestions,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thomas
> >>>>>>>>>>> 
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Bioconductor mailing list
> >>>>>>>>>>> Bioconductor at r-project.org
> >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>>>>>>> 
> >>> 
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>> 
> >> 
> >> 
> >> -- 
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >> 
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:4}}