[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Wed Apr 10 05:01:49 CEST 2013

Thanks for the tip. I guess doing it this way reverses the counting mode
back to countOverlaps, but how can I use at the same time
"IntersectionStrict" or any of the other modes provided by
summarizeOverlaps if its mode argument is already used and countOverlaps
doesn't accept one?

Thomas

On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> Thanks for the example. You're right, none of the modes will count a 
> read falling completely within >1 feature.
> 
> You can supply your own function as a 'mode'. Two requirements must be met:
> 
> (1) The function must have 3 arguments that correspond to
>      'features', 'reads' and 'ignore.strand', in that order.
> 
> (2) The function must return a vector of counts the same length
>      as 'features'
> 
> Here is an example using countOverlaps() which I think gets at the 
> counting you want.
> 
> counter <- function(x, y,  ignore.strand)
>      countOverlaps(y, x, ignore.strand=ignore.strand)
> 
>  > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>       [,1]
> [1,]    1
> [2,]    1
> 
> 
> Valerie
> 
> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> > Hi Valerie,
> >
> > Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> > that we often need for cases like this:
> >
> > Read1    ----
> > F1 ----------------
> > F2   -----------
> >
> > where we would like to have at least the option to assign Read1 to both F1 and F2:
> >
> > F1: 1
> > F2: 1
> >
> > However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> > available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> > option is frequently a problem when working with poorly annotated genomes (high
> > uncertainty about hidden/incorrect feature overlaps) or various
> > RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> > assignments than not counting at all as shown above.
> >
> > ## Here is a code example illustrating the same case:
> > library(GenomicRanges); library(Rsamtools)
> > rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
> >        pos = as.integer(c(500)),
> >        cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> > gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> > gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> > gr <- c(gr1, gr2)
> >
> > ## All of the three current modes in summarizeOverlaps return a count of zero
> > ## because of its inter_feature_overlap awareness:
> > assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> > assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> > assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >       [,1]
> > [1,]    0
> > [2,]    0
> >
> > ## However, countOverlaps handles this correctly, but is not the best choice when
> > ## counting multiple range features.
> > countOverlaps(gr, rd)
> > [1] 1 1
> >
> >
> > Thomas
> >
> >> sessionInfo()
> > R version 3.0.0 (2013-04-03)
> > Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >
> > locale:
> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >
> > attached base packages:
> > [1] parallel  stats     graphics  grDevices utils     datasets  methods
> > [8] base
> >
> > other attached packages:
> > [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
> > [4] IRanges_1.18.0       BiocGenerics_0.6.0
> >
> > loaded via a namespace (and not attached):
> > [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
> >
> >
> > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> >> Hi Thomas,
> >>
> >> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>> Dear Valerie,
> >>>
> >>> Is there currently any way to run summarizeOverlaps in a feature-overlap
> >>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> >>> one can switch back to countOverlaps when feature overlap unawareness is
> >>> the more appropriate counting mode for a biological question, but then
> >>> double counting of reads mapping to multiple-range features is not
> >>> accounted for. It would be really nice to have such a feature-overlap
> >>> unaware option directly in summarizeOverlaps.
> >>
> >> No, we don't currently have an option to ignore feature-overlap. It
> >> sounds like you want to count with countOverlaps() but still want the
> >> double counting to be resolved. This is essentially what the other modes
> >> are doing so I must be missing something.
> >>
> >> In this example 2 reads hit feature A, 1 read hits feature B. With
> >> something like ignorefeature0L=TRUE, what results would you expect to
> >> see? Maybe you have another, more descriptive example?
> >>
> >> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>                       names=c("A", "B")))
> >>
> >>   > countOverlaps(features, reads)
> >> [1] 2 1
> >>
> >>
> >>>
> >>> Another question relates to the memory usage of summarizeOverlaps. Has
> >>> this been optimized yet? On a typical bam file with ~50-100 million
> >>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> >>> use the function on a desktop computer or in large-scale RNA-Seq
> >>> projects on a commodity compute cluster, it would be desirable if every
> >>> counting instance would consume not more than 5GB of RAM.
> >>
> >> Have you tried the BamFileList-method? There is an example at the bottom
> >> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> >> mentioned, the key is to set the 'yieldSize' parameter when creating the
> >> BamFile. This method also makes use of mclapply().
> >>
> >> Valerie
> >>
> >>>
> >>> Thanks in advance for your help and suggestions,
> >>>
> >>> Thomas
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>