[BioC] Using GenomicAlignments to get Exon-centric Counts

Tue Jul 1 01:29:49 CEST 2014

Hi Fong,

readGAlignments() is for single-end data, readGAlignmentPairs() and 
readGAlignmentsList() are for paired-end. If you read paired-end data 
with readGAlignments() and perform counting, a read that overlaps both 
pair parts will be counted twice (once for each part). This is probably 
not what you want.

More below.

On 06/30/2014 03:19 PM, Fong Chun Chan wrote:
> Hi,
>
> Was wondering if anyone had any experience working with the
> GenomicAlignments R package when trying to retrieve per-exon count data?
> Specifically, whether these is rationale for using readGAlignmentsPairs()
> over readGAlignments() when using the summarizeOverlaps() function? From my
> understanding, the readGAlignmentsPairs() function will traverse the input
> bam file and return each read pair as one GAlignment.

Data read with readGAlignmentPairs() is put in a GAlignmentPairs class. 
Counting operations on a GAlignmentPairs class are aware that each pair 
should be counted as one record. If a range overlaps one or both 
portions of the pair it will be counted as one hit.

The same paired-end data read in with readGAlignments() is put in a 
GAlignments class. Counting operations on this class treat each 
record/row individually so you will get more counts than with the 
GAlignmentPairs class (you would get exactly double the counts if each 
read overlapped both parts of the pair).

Looking at the
> summarizeOverlaps() function document, it states that:
>
> "paired-end reads : Paired-end reads are counted the same as single-end
> reads with a junction (N operation in the CIGAR); each pair registers as a
> single hit. Paired-end records can be counted in a GAlignmentPairs
> container or BAM file."

This statement is confusing and should be removed. Thanks for pointing 
it out. I'll update the docs.

summarizeOverlaps() uses different read functions depending on what 
arguments are used to call it. Here is a summary.

- readGAlignments() when 'sigleEnd=TRUE'
- readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE'
- readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE'

If you have paired-end data you should be setting 'singleEnd=FALSE' in 
summarizeOverlaps(). If you are interested in all reads, not just those 
that are paired according to the algorithem, use 'fragments=TRUE'. The 
?readGAlignmentsList man page describes the pairing algorithm and how to 
use the 'mate_status' metadata column to investigate reads of different 
pairing status.

>
> If the paired-end reads are counted as single-end reads, does this not
> suggest that using the readGAlignments() and readGAlignmentsPairs() returns
> you somewhat similar results?

 From above: paired-end are not counted as single-end.

>
> Another question I have is if a single-read is actually split between two
> exons and the exons features are stored as a GRanges (and NOT a
> GRangesList) object then the default summarizeOverlaps() function would
> technically discard this read if I understand correctly? Since the default
> mode is Union and the read overlaps with both features. In fact, wouldn't
> all the modes actually discard the read? If so, is it correct to use
> inter.feature = FALSE, to include split-reads into the equationb?

The default is 'Union' but you can change that to 'IntersectionStrict' 
or 'IntersectionNotEmpty'. It is possible that none of the modes will be 
able to 'make a decision' about what feature the read should be assigned 
to. There is a graphic in the vignette that shows what types of overlaps 
are handled.

browseVignettes('GenomicAlignments')

Using 'inter.feature=FALSE' will allow double counting; the spanning 
read will be counted once for each feature it overlaps. See the 
'Counting multi-hit reads' example section at the bottom of the 
summarizeOverlaps man page.

Valerie

>
> Thanks for any advice on this. sessionInfo() below:
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
>        LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
>              LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] argparse_1.0.1          proto_0.3-10
>   GenomicAlignments_1.0.1 BSgenome_1.32.0         Rsamtools_1.16.1
>   Biostrings_2.32.0       XVector_0.4.0           GenomicFeatures_1.16.2
>   AnnotationDbi_1.26.0    Biobase_2.24.0          GenomicRanges_1.16.3
>   GenomeInfoDb_1.0.2      IRanges_1.22.9          BiocGenerics_0.10.0
> vimcom.plus_0.9-93
> [16] setwidth_1.0-3          colorout_1.0-2
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
> biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6         checkmate_1.0
>   codetools_0.2-8    DBI_0.2-7          digest_0.6.4       fail_1.2
>    findpython_1.0.1   foreach_1.4.2      getopt_1.20.0      iterators_1.0.7
>     plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
> [19] rjson_0.2.14       RSQLite_0.11.4     rtracklayer_1.24.2
> sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2      tools_3.1.0
>   XML_3.98-1.1       zlibbioc_1.10.0
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158