[BioC] summarizeOverlaps using GRanges or bed file as reads?

Ryan rct at thompsonclan.org
Tue Apr 15 19:29:15 CEST 2014


Ok, so using this method, I could write a function that takes the reads, 
coerces them to GRanges, and then manipulates the start and end points 
as I like, and then dispatches to the standard counting functions. Then 
I can call summarizeOverlaps on my bam files directly. That's pretty nice.

On 4/14/14, 11:35 PM, Martin Morgan wrote:
> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
>> Hello,
>>
>> I would like to manipulate the start and end positions of my reads 
>> before
>> calling summarizeOverlaps. One way to do this is to convert my reads 
>> to a
>> GRanges and then use flank, narrow, etc. to properly position the 
>> read ends
>> where I want them. However, I don't see a method for 
>> summarizeOverlaps that
>> accepts a GRanges object or bed file or similar for the reads. Is 
>> there such a
>> method, and if not, would it be possible to add it?
>>
>> The specific application I have in mind is single-end ChIP-Seq reads, 
>> where we
>> have a good idea of what the fragment length is and would like to 
>> extend the
>> reads to this length. Alternately, it may be preferable to count only 
>> the
>> 5-prime ends of the read, and this could be done by narrowing to 1 bp 
>> length.
>
>
> if bam is 'bed file or similar' then... summarizeOverlaps can take an 
> arbitrary function as it's 'mode' argument, as documented on 
> ?summarizeOverlaps
>
>      ## The 'mode' argument can take a custom count function whose
>      ## arguments are the same as those in the current counting modes
>      ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
>      ## In this example records are filtered by map quality before 
> counting.
>
>      mapq_filter <- function(features, reads,  ignore.strand, 
> inter.feature) {
>          require(GenomicAlignments) # needed for parallel evaluation
>          Union(features, reads[mcols(reads)$mapq >= 20],
>                ignore.strand=ignore.strand,
>                inter.feature=inter.feature)
>      }
>
>      genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
>      param <- ScanBamParam(what="mapq")
>      fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>      se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
>      assays(se)$counts
>
>      ## The count function can be completely custom (i.e., not use the
>      ## pre-defined count functions at all). Requirements are that
>      ## the input arguments match the pre-defined modes and the output
>      ## is a vector of counts the same length as 'features'.
>
>      my_count <- function(features, reads,  ignore.strand, 
> inter.feature) {
>          ## perform filtering, or subsetting etc.
>          require(GenomicAlignments) # needed for parallel evaluation
>          countOverlaps(features, reads)
>      }
>
> the 'reads' argument is a GAlignments / GAlignmentsList (for single / 
> paired reads) so you could do whatever you'd like to them, so long as 
> your function returns a vector of counts equal to length(features)
>
>     ryans_count = function(features, reads, ...) {
>         reads = as(reads, "GRanges")   ## equivalently (??) 
> granges(reads)
>         width(reads) = 147
>         countOverlaps(features, reads)
>     }
>
> I think you're also interested in width of overlaps, so you could 
> implement that in ryans_count, too, e.g., via
>
>
> http://stackoverflow.com/questions/14685802/width-of-the-overlapped-segment-in-genomicranges-package 
>
>
> or
>
> https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html
>
> and other parts of that thread, include the late continuation
>
> https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html
>
> (probably need to think through carefully what these are doing).
>
> summarizeOverlaps will iterate through a bam file in chunks, so 
> moderating memory use (your function will be called several times).
>
> If you have several bam files and a linux / Mac, load the BiocParallel 
> package and it'll distribute one bam file to each processor. If memory 
> becomes a problem use BamFileList(your_files, yieldSize=500000) or 
> similar to reduce the size of each chunk (the default yield size is I 
> think 1,000,000).
>
>
> A newer approach is to build a tool of your own using the bed / wig 
> reader functions in rtracklayer with GenomicRanges::tileGenome to 
> create large-ish chunks of file to process, followed by, e.g., 
> GenomicFiles::reduceByFile to MAP from the file to count in each tile, 
> and REDUCE to summarize counts within a file; the vignette outlines 
> this approach for several case studies (though not for counting reads 
> overlapping ranges -- I'm sure a worked example using, e.g.,
>
>
> http://bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html 
>
>
> would be appreciated).
>
>
> Martin
>
>>
>> -Ryan Thompson
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list