[BioC] summarizeOverlaps using GRanges or bed file as reads?

Tue Apr 15 19:25:06 CEST 2014

Thanks! This is exactly what I wanted.

On Tue Apr 15 09:24:27 2014, Valerie Obenchain wrote:
> Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to
> GenomicAlignments 1.1.1.
>
> Valerie
>
> On 04/14/2014 11:35 PM, Martin Morgan wrote:
>> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
>>> Hello,
>>>
>>> I would like to manipulate the start and end positions of my reads
>>> before
>>> calling summarizeOverlaps. One way to do this is to convert my reads
>>> to a
>>> GRanges and then use flank, narrow, etc. to properly position the read
>>> ends
>>> where I want them. However, I don't see a method for summarizeOverlaps
>>> that
>>> accepts a GRanges object or bed file or similar for the reads. Is
>>> there such a
>>> method, and if not, would it be possible to add it?
>>>
>>> The specific application I have in mind is single-end ChIP-Seq reads,
>>> where we
>>> have a good idea of what the fragment length is and would like to
>>> extend the
>>> reads to this length. Alternately, it may be preferable to count
>>> only the
>>> 5-prime ends of the read, and this could be done by narrowing to 1 bp
>>> length.
>>
>>
>> if bam is 'bed file or similar' then... summarizeOverlaps can take an
>> arbitrary function as it's 'mode' argument, as documented on
>> ?summarizeOverlaps
>>
>>       ## The 'mode' argument can take a custom count function whose
>>       ## arguments are the same as those in the current counting modes
>>       ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
>>       ## In this example records are filtered by map quality before
>> counting.
>>
>>       mapq_filter <- function(features, reads,  ignore.strand,
>> inter.feature) {
>>           require(GenomicAlignments) # needed for parallel evaluation
>>           Union(features, reads[mcols(reads)$mapq >= 20],
>>                 ignore.strand=ignore.strand,
>>                 inter.feature=inter.feature)
>>       }
>>
>>       genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
>>       param <- ScanBamParam(what="mapq")
>>       fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>       se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
>>       assays(se)$counts
>>
>>       ## The count function can be completely custom (i.e., not use the
>>       ## pre-defined count functions at all). Requirements are that
>>       ## the input arguments match the pre-defined modes and the output
>>       ## is a vector of counts the same length as 'features'.
>>
>>       my_count <- function(features, reads,  ignore.strand,
>> inter.feature) {
>>           ## perform filtering, or subsetting etc.
>>           require(GenomicAlignments) # needed for parallel evaluation
>>           countOverlaps(features, reads)
>>       }
>>
>> the 'reads' argument is a GAlignments / GAlignmentsList (for single /
>> paired reads) so you could do whatever you'd like to them, so long as
>> your function returns a vector of counts equal to length(features)
>>
>>      ryans_count = function(features, reads, ...) {
>>          reads = as(reads, "GRanges")   ## equivalently (??)
>> granges(reads)
>>          width(reads) = 147
>>          countOverlaps(features, reads)
>>      }
>>
>> I think you're also interested in width of overlaps, so you could
>> implement that in ryans_count, too, e.g., via
>>
>>
>> http://stackoverflow.com/questions/14685802/width-of-the-overlapped-segment-in-genomicranges-package
>>
>>
>>
>> or
>>
>>    https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html
>>
>> and other parts of that thread, include the late continuation
>>
>>    https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html
>>
>> (probably need to think through carefully what these are doing).
>>
>> summarizeOverlaps will iterate through a bam file in chunks, so
>> moderating memory use (your function will be called several times).
>>
>> If you have several bam files and a linux / Mac, load the BiocParallel
>> package and it'll distribute one bam file to each processor. If memory
>> becomes a problem use BamFileList(your_files, yieldSize=500000) or
>> similar to reduce the size of each chunk (the default yield size is I
>> think 1,000,000).
>>
>>
>> A newer approach is to build a tool of your own using the bed / wig
>> reader functions in rtracklayer with GenomicRanges::tileGenome to create
>> large-ish chunks of file to process, followed by, e.g.,
>> GenomicFiles::reduceByFile to MAP from the file to count in each tile,
>> and REDUCE to summarize counts within a file; the vignette outlines this
>> approach for several case studies (though not for counting reads
>> overlapping ranges -- I'm sure a worked example using, e.g.,
>>
>>
>> http://bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html
>>
>>
>>
>> would be appreciated).
>>
>>
>> Martin
>>
>>>
>>> -Ryan Thompson
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>