[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Thomas Girke thomas.girke at ucr.edu
Wed Apr 10 22:43:08 CEST 2013


Yes, when trying to switch from countOverlaps to summarizeOverlaps I was
hoping the latter would provide all the functionalities of the former
plus some of the new features. Achieving the desirable result with a
custom function would be fine for me or other more advanced R users, but
I don't think it is a substitute for predefined and ~FDA approved~
counting modes where we want to have assurance that scientist A can
communicate to scientist B what counter s/he used and both get the same
result while not requiring any expert knowledge in R. I feel this is
particularly important for read counting since it is so fundamental to 
many application areas in NGS sequence analysis.

Thomas


On Wed, Apr 10, 2013 at 04:59:53PM +0000, Valerie Obenchain wrote:
> I think he wants both the counts from using a 'mode' and counts for the 
> intra-feature reads. Using countOverlaps(..., type="within") would get 
> the counts for the intra-feature reads only and then you'd need to add 
> them to the counts from using 'mode'.
> 
> The interface for using findOverlaps/countOverlaps from 
> summarizeOverlaps() alreay exists but may not be user-friendly or 
> intuitive enough. For example, if you wanted to count with 
> countOverlaps(..., type="within"), where you are interested in the reads 
> falling "within" the features, the counter would be something like,
> 
> counter <- function(x, y,  ignore.strand)
> {
>      counts <- subjectHits(findOverlaps(y, x, type="within",
>                                         ignore.strand=ignore.strand))
>      structure(tabulate(counts, NROW(x)), names=names(x))
> }
> 
> 
> Too cryptic?
> 
> Valerie
> 
> 
> On 04/10/2013 09:16 AM, Michael Lawrence wrote:
> > It sounds like what Thomas wants should be achievable with
> > countOverlaps. For example, IntersectionStrict (if treating all features
> > independently) would equate to type="within". So we just need a
> > summarizeOverlaps-style interface to those simple utilities. I agree
> > this would be very useful.
> >
> > Michael
> >
> >
> > On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at fhcrc.org
> > <mailto:vobencha at fhcrc.org>> wrote:
> >
> >     Ah, I see. You'd like to count with one of the existing modes but
> >     have the option to pick up counts for these inter-feature reads
> >     (fall completely 'within' >1 feature). These inter-feature reads
> >     would be double (triple, quadruple, etc.) counted. Essentially they
> >     would add one count to each feature they hit. Right?
> >
> >     One more thought about memory usage. If you are working with
> >     single-end reads the summarizeOverlaps,BamFileList-__method I
> >     mentioned below should work fine. The approach is slightly different
> >     with paired-end reads. Our current algorithm for pairing paired-end
> >     reads requires the whole file to be read into memory. A different
> >     approach is currently being developed but in the meantime you can
> >     take the qname-sorted approach. The Bam file will need to be sorted
> >     by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when
> >     creating the BamFile/BamFileList. An example is on ?BamFile.
> >
> >     Valerie
> >
> >
> >
> >     On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >
> >         Thanks for the tip. I guess doing it this way reverses the
> >         counting mode
> >         back to countOverlaps, but how can I use at the same time
> >         "IntersectionStrict" or any of the other modes provided by
> >         summarizeOverlaps if its mode argument is already used and
> >         countOverlaps
> >         doesn't accept one?
> >
> >         Thomas
> >
> >         On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >
> >             Thanks for the example. You're right, none of the modes will
> >             count a
> >             read falling completely within >1 feature.
> >
> >             You can supply your own function as a 'mode'. Two
> >             requirements must be met:
> >
> >             (1) The function must have 3 arguments that correspond to
> >                    'features', 'reads' and 'ignore.strand', in that order.
> >
> >             (2) The function must return a vector of counts the same length
> >                    as 'features'
> >
> >             Here is an example using countOverlaps() which I think gets
> >             at the
> >             counting you want.
> >
> >             counter <- function(x, y,  ignore.strand)
> >                    countOverlaps(y, x, ignore.strand=ignore.strand)
> >
> >                > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >                     [,1]
> >             [1,]    1
> >             [2,]    1
> >
> >
> >             Valerie
> >
> >             On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >
> >                 Hi Valerie,
> >
> >                 Perhaps let's call this more explicitly an
> >                 ignore_inter_feature_overlap option
> >                 that we often need for cases like this:
> >
> >                 Read1    ----
> >                 F1 ----------------
> >                 F2   -----------
> >
> >                 where we would like to have at least the option to
> >                 assign Read1 to both F1 and F2:
> >
> >                 F1: 1
> >                 F2: 1
> >
> >                 However, summarizeOverlapse doesn't count Read1 at all
> >                 in all of its currently
> >                 available modes that I am aware of. This lack of an
> >                 ignore_inter_feature_overlap
> >                 option is frequently a problem when working with poorly
> >                 annotated genomes (high
> >                 uncertainty about hidden/incorrect feature overlaps) or
> >                 various
> >                 RNA/ChIP-Seq _engineering_ projects where I rather take
> >                 the risk of ambiguous read
> >                 assignments than not counting at all as shown above.
> >
> >                 ## Here is a code example illustrating the same case:
> >                 library(GenomicRanges); library(Rsamtools)
> >                 rd <- GappedAlignments(letters[1], seqnames =
> >                 Rle(rep("chr1",1)),
> >                          pos = as.integer(c(500)),
> >                          cigar = rep("100M", 1), strand =
> >                 strand(rep("+", 1)))
> >                 gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
> >                 strand="+", ID="feat1")
> >                 gr2 <- GRanges("chr1", IRanges(start=500, width=101),
> >                 strand="+", ID="feat2")
> >                 gr <- c(gr1, gr2)
> >
> >                 ## All of the three current modes in summarizeOverlaps
> >                 return a count of zero
> >                 ## because of its inter_feature_overlap awareness:
> >                 assays(summarizeOverlaps(gr, rd, mode="Union",
> >                 ignore.strand=TRUE))$counts
> >                 assays(summarizeOverlaps(gr, rd,
> >                 mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >                 assays(summarizeOverlaps(gr, rd,
> >                 mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >                         [,1]
> >                 [1,]    0
> >                 [2,]    0
> >
> >                 ## However, countOverlaps handles this correctly, but is
> >                 not the best choice when
> >                 ## counting multiple range features.
> >                 countOverlaps(gr, rd)
> >                 [1] 1 1
> >
> >
> >                 Thomas
> >
> >                     sessionInfo()
> >
> >                 R version 3.0.0 (2013-04-03)
> >                 Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >
> >                 locale:
> >                 [1]
> >                 en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
> >
> >                 attached base packages:
> >                 [1] parallel  stats     graphics  grDevices utils
> >                 datasets  methods
> >                 [8] base
> >
> >                 other attached packages:
> >                 [1] Rsamtools_1.12.0     Biostrings_2.28.0
> >                   GenomicRanges_1.12.1
> >                 [4] IRanges_1.18.0       BiocGenerics_0.6.0
> >
> >                 loaded via a namespace (and not attached):
> >                 [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0
> >                   zlibbioc_1.6.0
> >
> >
> >                 On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie
> >                 Obenchain wrote:
> >
> >                     Hi Thomas,
> >
> >                     On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >
> >                         Dear Valerie,
> >
> >                         Is there currently any way to run
> >                         summarizeOverlaps in a feature-overlap
> >                         unaware mode, e.g with an
> >                         ignorefeatureOL=FALSE/TRUE setting? Currently,
> >                         one can switch back to countOverlaps when
> >                         feature overlap unawareness is
> >                         the more appropriate counting mode for a
> >                         biological question, but then
> >                         double counting of reads mapping to
> >                         multiple-range features is not
> >                         accounted for. It would be really nice to have
> >                         such a feature-overlap
> >                         unaware option directly in summarizeOverlaps.
> >
> >
> >                     No, we don't currently have an option to ignore
> >                     feature-overlap. It
> >                     sounds like you want to count with countOverlaps()
> >                     but still want the
> >                     double counting to be resolved. This is essentially
> >                     what the other modes
> >                     are doing so I must be missing something.
> >
> >                     In this example 2 reads hit feature A, 1 read hits
> >                     feature B. With
> >                     something like ignorefeature0L=TRUE, what results
> >                     would you expect to
> >                     see? Maybe you have another, more descriptive example?
> >
> >                     reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >                     features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >                                             names=c("A", "B")))
> >
> >                         > countOverlaps(features, reads)
> >                     [1] 2 1
> >
> >
> >
> >                         Another question relates to the memory usage of
> >                         summarizeOverlaps. Has
> >                         this been optimized yet? On a typical bam file
> >                         with ~50-100 million
> >                         reads the memory usage of summarizeOverlaps is
> >                         often around 10-20GB. To
> >                         use the function on a desktop computer or in
> >                         large-scale RNA-Seq
> >                         projects on a commodity compute cluster, it
> >                         would be desirable if every
> >                         counting instance would consume not more than
> >                         5GB of RAM.
> >
> >
> >                     Have you tried the BamFileList-method? There is an
> >                     example at the bottom
> >                     of the ?BamFileList man page using
> >                     summarizeOverlaps(). As Ryan
> >                     mentioned, the key is to set the 'yieldSize'
> >                     parameter when creating the
> >                     BamFile. This method also makes use of mclapply().
> >
> >                     Valerie
> >
> >
> >                         Thanks in advance for your help and suggestions,
> >
> >                         Thomas
> >
> >                         _________________________________________________
> >                         Bioconductor mailing list
> >                         Bioconductor at r-project.org
> >                         <mailto:Bioconductor at r-project.org>
> >                         https://stat.ethz.ch/mailman/__listinfo/bioconductor
> >                         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> >                         Search the archives:
> >                         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> >                         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >
> >
> >     _________________________________________________
> >     Bioconductor mailing list
> >     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> >     https://stat.ethz.ch/mailman/__listinfo/bioconductor
> >     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> >     Search the archives:
> >     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >
> >



More information about the Bioconductor mailing list