[BioC] counting RNA-seq reads in R/BioC

Tue Sep 11 18:00:56 CEST 2012

Hi Mark,

Great summary. Thanks for putting it together.

You are correct that summarizeOverlaps() only counts paired-end reads 
with mapped mates. This is because Bam files are read with 
readGappedAlignmentPairs() which retains only reads that are paired and 
have mapped mates,

      flag0 <- scanBamFlag(isPaired=TRUE, hasUnmappedMate=FALSE)

This filtering was done as a qc step. It sounds like there is a desire 
to have an option to also count singletons so we can look to implement that.

I agree with Nico that it would be useful for a subset of us to discuss 
approaches offline and come up with some common implementation options.  
Regarding the junction reads, both Michael and Herve have done work 
towards this with  findSpliceOverlaps() and encodeOverlaps() in 
GenomicRanges. These are in early stages and don't currently produce 
just junction-level counts but the idea of a wrapper seems very reasonable.

Valerie

On 09/11/2012 02:48 AM, Mark Robinson wrote:
> Hi all,
>
> Sorry for cross-posting, but I figured this might be of interest to both developers and users.
>
> My student (Xiaobei Zhou) and I have been playing around with the "simple" task of counting RNA-seq reads in Bioconductor and we've collected a few observations.  My apologies in advance if I've made errors in calling the software or if I've omitted other methods to try.  Please correct us if so.
>
> Here is a (hopefully) concise summary of what we've found:
>
> 1. There are now various options in Bioconductor to go from (mapped) RNA-seq reads to counts at the gene, transcript, exon level.  The main players, as we understand it, are: easyRNASeq, summarizeOverlaps() in GenomicRanges and featureCounts() in Rsubread.  Outside of Bioconductor, we have used htseq-count from Simon Anders and treat this as the standard, since it's fairly mature.  For reference, we've collected code segments for all methods below, in case anyone wants to repeat this, or needs this for their data (of course, as a resource, the mailing list is not the best place to capture this).
>
> 2. For single-end reads, the methods are quite similar, although differences exist.  I attribute this to differences in how reads are matched to features (e.g. overhanging reads) and/or filtering.  Nailing down the exact details are a bit hard to figure out, but I expect these to be minor overall.  Here is an example: http://imlspenticton.uzh.ch/se.png
>
> 3. For paired-end reads, there is (or at least can be) a distinct divergence in read counts: http://imlspenticton.uzh.ch/pe.png ... at present, easyRNASeq and featureCounts()/Rsubread "overcount" reads and summarizeOverlap() "undercounts", relative to htseq-count.
>
> Wei Shi (Rsubread author) has recently mentioned that this is the way he likes to count:
> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046440.html
>
> I've spoken with Nico Delhomme (easyRNASeq author) offline and accurately treating mate reads is on his TODO list.
>
> The undercounting in summarizeOverlaps(), as far as we can tell, is due to filtering singleton reads (one read of the mate is mapped, the other is not) and not counting them.
>
> 4. I have some results on memory usage and CPU time.  The packages available have different strategies to manage the tradeoff, so they vary widely.
>
>
> A couple comments / suggestions:
>
> 1. From a user perspective, it would be optimal to have all options available: count singletons or not, overhang or no overhang, counting a pair as a single fragment, etc.
>
> 2. Another level of flexibility that would be useful is counting on junctions.  I think the infrastructure for this is now available:
> https://stat.ethz.ch/pipermail/bioconductor/2012-May/045506.html
> Is there anyone working on a wrapper that just takes the reads (or pointer to BAM) and features and adds junction-level counts to the exon-level ones?
>
>
> Please discuss.
>
> These are observations, not criticisms.  I sincerely thank the contributors for their excellent packages; it's very nice to have multiple options available.
>
>
> Best regards,
> Mark
>
>
>
>
> # Do not expect this to run.  Populate a TXT file
> # listing all the filenames (BAM/SAM)
>
> # Given: metadata with file names
> # starting from GTF files and BAM/SAM, derive tables of counts
> sample.inform<- read.table("samples_bam_sam_etc.txt", header = TRUE, stringsAsFactors = FALSE)
>
> ## -------------------
> ## easyRNASeq
> ## -------------------
>
> library(easyRNASeq)
> library(BSgenome.Dmelanogaster.UCSC.dm3)
>
> gtf<- "Drosophila_melanogaster.BDGP5.67.gtf"
> bamFlsR<- sample.inform[, "bamfile_resort"]
>
> c_ers<- easyRNASeq(organism = "Dmelanogaster",
>                      chr.sizes = as.list(seqlengths(Dmelanogaster)),
>                      annotationMethod = "gtf", annotationFile= gtf,
>                      format = "bam",gapped = TRUE, count = "genes",
>                      summarization = "geneModels",
>                      filenames = bamFlsR, filesDirectory = dir)
>
>
> ## -------------------
> ## countOverlaps
> ## -------------------
>
> library(GenomicFeatures)
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
>
> txdb<- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> gnModel<- exonsBy(txdb, "gene")
> bamFls<- paste(sample.inform[, "bamfile_s"], sep = "")
>
>
>   UCSC2FlybaseGRanges<- function (GRanges) {
> ### Rename the chromosomes in<GRanges>   from UCSC conventions ('chr1',
> ### etc) to comport with Flybase conventions ('1', etc) by stripping
> ### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
> seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
> seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
> GRanges
> }
>
> gnModelT<- UCSC2FlybaseGRanges(gnModel)
>
> counter<- function(fl, gnModel)
> {
>      aln<- readGappedAlignments(fl)
>      strand(aln)<- "*" # for strand-blind sample prep protocol
>      hits<- countOverlaps(aln, gnModel)
>      counts<- countOverlaps(gnModel, aln[hits==1])
>      names(counts)<- names(gnModel)
>      counts
> }
> c_conOver<- sapply(bamFls, counter, gnModel = gnModelT)
>
> ## -------------------
> ## summarizeOverlaps using UCSC annotation
> ## -------------------
>
> library(GenomicRanges)
> library(Rsamtools)
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> library(GenomicFeatures)
> bamFlsR<- sample.inform[, "bamfile_resort"]
> # separate paired and unpaired
> bamFlsR_pe<- bamFlsR[sample.inform[, "libtype"] == "PE"]
> bamFlsR_se<- bamFlsR[sample.inform[, "libtype"] == "SE"]
> bamFlsR_peL<- BamFileList(bamFlsR_pe)
> bamFlsR_seL<- BamFileList(bamFlsR_se)
> txdb<- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> gnModel<- exonsBy(txdb, "gene")
>
>
>   UCSC2FlybaseGRanges<- function (GRanges) {
> ### Rename the chromosomes in<GRanges>   from UCSC conventions ('chr1',
> ### etc) to comport with Flybase conventions ('1', etc) by stripping
> ### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
> seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
> seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
> GRanges
> }
>
> gnModelT<- UCSC2FlybaseGRanges(gnModel)
> con_pe<- summarizeOverlaps(gnModelT, bamFlsR_peL,
>                              mode = "Union", singleEnd=FALSE, ignore.strand = TRUE)
> con_se<- summarizeOverlaps(gnModelT, bamFlsR_seL,
>                              mode = "Union", singleEnd=TRUE, ignore.strand = TRUE)
> c_summOver<- cbind(assays(con_pe)$counts, assays(con_se)$counts)
> colnames(count_summOver)<- c(basename(bamFlsR_pe), basename(bamFlsR_se))
>
>
> ## -------------------
> ## Rsubread
> ## -------------------
>
> library(Rsubread)
> samFls<- sample.inform[, "samfile"]
>
> gtf<- "Drosophila_melanogaster.BDGP5.67.gtf"
> z<- read.table(gtf, sep="\t", stringsAsFactors=FALSE)
> z<- z[z$V3=="exon",]
> ids<- gsub(" gene_id ","",sapply(strsplit(z$V9,";"),.subset,1))
> anno<- data.frame(entrezid=as.integer(factor(ids)),
>                     chromosome=z$V1, chr_start=z$V4, chr_stop=z$V5)
>
> ID<- sort(unique(anno$entrezid))
> genID<- ids[match(ID, anno$entrezid)]
>
> c_rsub<- featureCounts(SAMfiles=samFls, annot=anno, type = "gene")$counts
> rownames(count_rsub)<-genID
> colnames(count_rsub)<-basename(samFls)
>
>
>
>
>
>
>
>
>
>
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics
> Institute of Molecular Life Sciences
> University of Zurich
> Winterthurerstrasse 190
> 8057 Zurich
> Switzerland
>
> v: +41 44 635 4848
> f: +41 44 635 6898
> e: mark.robinson at imls.uzh.ch
> o: Y11-J-16
> w: http://tiny.cc/mrobin
>
> ----------
> http://www.fgcz.ch/Bioconductor2012
> http://www.eccb12.org/t5
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor