[BioC] Amplicon and exon level read counts and GC content

Martin Morgan mtmorgan at fhcrc.org
Fri Jun 8 15:03:04 CEST 2012


On 06/07/2012 11:34 PM, Yu Chuan Tai wrote:
> Hi Martin,
>
> Is it correct that the below code calculate the number of reads overlapping
> with an amplicon, and overlapping means at least 1 base overlap, and it
> doesn't have to be fully within the amplicon? In the case that a read
> overlaps
> with 2 amplicons, will it be counted twice? When I used this approach to
> calculate
> amplicon-level read counts, I found the number of read counts
> overlapping with
> all the amplicons is larger than the total number of read counts in the
> BAM file,
> and wonder if that's b/c a read could be counted more than once?

yes

> I found that the code below gives more read counts than using
> summarizeOverlaps().
> I think it's b/c the latter counts a read at most once.

see ?summarizeOverlaps for how counting occurs with different modes.

>
> If I want to calculate the coverage of SNVs/INDELs outputted from samtools,
> is it correct that using summarizeOverlaps() will under-estimate the
> coverage, since
> a read may overlap with several SNVs/INDELs?

It is 'easy' using findOverlaps() or countOverlaps() to create counting 
schemes that are different from those implemented in samtools / scanBam, 
summarizeOverlaps, etc.

Martin

>
> Thanks!
> Yu Chuan
>
>  > param = ScanBamParam(what="seq", which=GRanges("seq1",IRanges(100,
> 500)))
>  > dna = scanBam(fl, param=param)[[1]][["seq"]]
>  > length(dna) # 365 reads overlap region
>  > alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC
>
>
>
> On Thu, 7 Jun 2012, Martin Morgan wrote:
>
>> On 06/07/2012 07:54 AM, Yu Chuan Tai wrote:
>>> Hi Martin,
>>>
>>> Thanks! I will look into the links below. By 'better support for
>>> paired-end reads in the 'devel' version', which package are you
>>> referring to?
>>
>> Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on
>> additional facilities in Rsamtools. Herve is responsible for this.
>>
>> Martin
>>
>>>
>>> Best,
>>> Yu Chuan
>>>
>>> On Thu, 7 Jun 2012, Martin Morgan wrote:
>>>
>>>> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote:
>>>>> Hi Martin,
>>>>>
>>>>> More questions on your approaches below. If my BAM files are
>>>>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(),
>>>>> should I set isPaired=TRUE? Do I need to worry about other input
>>>>> arguments for scanBamFlag() or ScanBamParam(), if I want to
>>>>> calculate coverage properly?
>>>>
>>>> It really depends on what you're interested in doing; see for instance
>>>> the post by Herve the other day
>>>>
>>>> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html
>>>>
>>>>>
>>>>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads.
>>>>> How to get around this, or it won't affect coverage calculation?
>>>>
>>>> There is better support for paired-end reads in the 'devel' version of
>>>> Biocondcutor; see
>>>>
>>>> http://bioconductor.org/developers/useDevel/
>>>>
>>>> whether and what aspects of paired-endedness are important depends on
>>>> how you are using your coverage.
>>>>
>>>>>
>>>>> Finally, is there any way to calculate base-specific coverage at any
>>>>> genomic locus or interval in Rsamtools? Thanks!
>>>>
>>>> I tried to answer this in your other post.
>>>>
>>>> Martin
>>>>
>>>>>
>>>>> Best, Yu Chuan
>>>>>
>>>>>> More specifically, after
>>>>>>
>>>>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a
>>>>>> bam file
>>>>>>
>>>>>> for a _single_ genomic range
>>>>>>
>>>>>> param = ScanBamParam(what="seq", which=GRanges("seq1",
>>>>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]]
>>>>>> length(dna) # 365 reads overlap region alphabetFrequency(dna,
>>>>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC
>>>>>>
>>>>>> though you'd likely want to specify several regions (vector
>>>>>> arguments to GRanges) and think about flags (scanBamFlag() and the
>>>>>> flag argument to ScanBamParam), read mapping quality, reads
>>>>>> overlapping more than one region, etc. (summarizeOverlaps
>>>>>> implements several counting strategies, but it is 'easy' to
>>>>>> implement arbitrary approaches).
>>>>>>
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for any input!
>>>>>>>>
>>>>>>>> Best, Yu Chuan
>>>>>>>>
>>>>>>>> _______________________________________________ Bioconductor
>>>>>>>> mailing list Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>>>>>> archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>> --
>>>>>> Computational Biology Fred Hutchinson Cancer Research Center 1100
>>>>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: M1-B861 Telephone: 206 667-2793
>>>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list