[BioC] Using GenomicAlignments to get Exon-centric Counts

Valerie Obenchain vobencha at fhcrc.org
Tue Jul 1 03:55:13 CEST 2014


Hi,

On 06/30/2014 04:51 PM, Fong Chun Chan wrote:
> Hi Valerie,
>
> Thanks for your reply. After re-reading that paired-end statement, I
> understand what you meant now and was definitely confused because I
> thought it was being treated as two single-end reads which would have
> defeated the purpose of using reading it in as a GAlignmentPairs object.
>
> Thanks for the clarification on the inter.feature = FALSE parameter.
> That is exactly what I want actually. I wanted split reads across two
> exons should be counted as contributing to each exon. Even more
> importantly are paired-reads where the other pair is mapped to another
> distinct exon needs to be included also as contributing one to each exon.

Great! I'm glad it fits that niche.

>
> The default counting mode settings are likely more applicable to when
> working on gene-centric models where you have GRangeList objects with
> each exon as a GRange object (hopefully I am correct in my understanding
> of that) and in those situations you probably don't want to double count.

Yes, the modes w/ default settings were intended for gene-based models, 
pattered after Simon Anders' HTSeq. 'inter.feature' was added later by 
request and has extended the functionality to situations like you describe.

Valerie
>
>
>
>
> On Mon, Jun 30, 2014 at 4:29 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Fong,
>
>     readGAlignments() is for single-end data, readGAlignmentPairs() and
>     readGAlignmentsList() are for paired-end. If you read paired-end
>     data with readGAlignments() and perform counting, a read that
>     overlaps both pair parts will be counted twice (once for each part).
>     This is probably not what you want.
>
>     More below.
>
>
>     On 06/30/2014 03:19 PM, Fong Chun Chan wrote:
>
>         Hi,
>
>         Was wondering if anyone had any experience working with the
>         GenomicAlignments R package when trying to retrieve per-exon
>         count data?
>         Specifically, whether these is rationale for using
>         readGAlignmentsPairs()
>         over readGAlignments() when using the summarizeOverlaps()
>         function? From my
>         understanding, the readGAlignmentsPairs() function will traverse
>         the input
>         bam file and return each read pair as one GAlignment.
>
>
>     Data read with readGAlignmentPairs() is put in a GAlignmentPairs
>     class. Counting operations on a GAlignmentPairs class are aware that
>     each pair should be counted as one record. If a range overlaps one
>     or both portions of the pair it will be counted as one hit.
>
>     The same paired-end data read in with readGAlignments() is put in a
>     GAlignments class. Counting operations on this class treat each
>     record/row individually so you will get more counts than with the
>     GAlignmentPairs class (you would get exactly double the counts if
>     each read overlapped both parts of the pair).
>
>
>     Looking at the
>
>         summarizeOverlaps() function document, it states that:
>
>         "paired-end reads : Paired-end reads are counted the same as
>         single-end
>         reads with a junction (N operation in the CIGAR); each pair
>         registers as a
>         single hit. Paired-end records can be counted in a GAlignmentPairs
>         container or BAM file."
>
>
>     This statement is confusing and should be removed. Thanks for
>     pointing it out. I'll update the docs.
>
>     summarizeOverlaps() uses different read functions depending on what
>     arguments are used to call it. Here is a summary.
>
>     - readGAlignments() when 'sigleEnd=TRUE'
>     - readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE'
>     - readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE'
>
>     If you have paired-end data you should be setting 'singleEnd=FALSE'
>     in summarizeOverlaps(). If you are interested in all reads, not just
>     those that are paired according to the algorithem, use
>     'fragments=TRUE'. The ?readGAlignmentsList man page describes the
>     pairing algorithm and how to use the 'mate_status' metadata column
>     to investigate reads of different pairing status.
>
>
>
>         If the paired-end reads are counted as single-end reads, does
>         this not
>         suggest that using the readGAlignments() and
>         readGAlignmentsPairs() returns
>         you somewhat similar results?
>
>
>      From above: paired-end are not counted as single-end.
>
>
>
>         Another question I have is if a single-read is actually split
>         between two
>         exons and the exons features are stored as a GRanges (and NOT a
>         GRangesList) object then the default summarizeOverlaps()
>         function would
>         technically discard this read if I understand correctly? Since
>         the default
>         mode is Union and the read overlaps with both features. In fact,
>         wouldn't
>         all the modes actually discard the read? If so, is it correct to use
>         inter.feature = FALSE, to include split-reads into the equationb?
>
>
>     The default is 'Union' but you can change that to
>     'IntersectionStrict' or 'IntersectionNotEmpty'. It is possible that
>     none of the modes will be able to 'make a decision' about what
>     feature the read should be assigned to. There is a graphic in the
>     vignette that shows what types of overlaps are handled.
>
>     browseVignettes('__GenomicAlignments')
>
>     Using 'inter.feature=FALSE' will allow double counting; the spanning
>     read will be counted once for each feature it overlaps. See the
>     'Counting multi-hit reads' example section at the bottom of the
>     summarizeOverlaps man page.
>
>
>
>     Valerie
>
>
>         Thanks for any advice on this. sessionInfo() below:
>
>             sessionInfo()
>
>         R version 3.1.0 (2014-04-10)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>
>         locale:
>            [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>         LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>         LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>           LC_PAPER=en_US.UTF-8
>                 LC_NAME=C                  LC_ADDRESS=C
>         LC_TELEPHONE=C
>                       LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>           methods
>         base
>
>         other attached packages:
>            [1] argparse_1.0.1          proto_0.3-10
>            GenomicAlignments_1.0.1 BSgenome_1.32.0         Rsamtools_1.16.1
>            Biostrings_2.32.0       XVector_0.4.0
>         GenomicFeatures_1.16.2
>            AnnotationDbi_1.26.0    Biobase_2.24.0
>           GenomicRanges_1.16.3
>            GenomeInfoDb_1.0.2      IRanges_1.22.9
>           BiocGenerics_0.10.0
>         vimcom.plus_0.9-93
>         [16] setwidth_1.0-3          colorout_1.0-2
>
>         loaded via a namespace (and not attached):
>            [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
>         biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>         checkmate_1.0
>            codetools_0.2-8    DBI_0.2-7          digest_0.6.4       fail_1.2
>             findpython_1.0.1   foreach_1.4.2      getopt_1.20.0
>           iterators_1.0.7
>              plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
>         [19] rjson_0.2.14       RSQLite_0.11.4     rtracklayer_1.24.2
>         sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2      tools_3.1.0
>            XML_3.98-1.1       zlibbioc_1.10.0
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Valerie Obenchain
>     Program in Computational Biology
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, Seattle, WA 98109
>
>     Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>     Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>
>



More information about the Bioconductor mailing list