[BioC] counting RNA-seq reads in R/BioC

Tue Sep 11 11:48:39 CEST 2012

Hi all,

Sorry for cross-posting, but I figured this might be of interest to both developers and users.

My student (Xiaobei Zhou) and I have been playing around with the "simple" task of counting RNA-seq reads in Bioconductor and we've collected a few observations.  My apologies in advance if I've made errors in calling the software or if I've omitted other methods to try.  Please correct us if so.

Here is a (hopefully) concise summary of what we've found:

1. There are now various options in Bioconductor to go from (mapped) RNA-seq reads to counts at the gene, transcript, exon level.  The main players, as we understand it, are: easyRNASeq, summarizeOverlaps() in GenomicRanges and featureCounts() in Rsubread.  Outside of Bioconductor, we have used htseq-count from Simon Anders and treat this as the standard, since it's fairly mature.  For reference, we've collected code segments for all methods below, in case anyone wants to repeat this, or needs this for their data (of course, as a resource, the mailing list is not the best place to capture this).

2. For single-end reads, the methods are quite similar, although differences exist.  I attribute this to differences in how reads are matched to features (e.g. overhanging reads) and/or filtering.  Nailing down the exact details are a bit hard to figure out, but I expect these to be minor overall.  Here is an example: http://imlspenticton.uzh.ch/se.png 

3. For paired-end reads, there is (or at least can be) a distinct divergence in read counts: http://imlspenticton.uzh.ch/pe.png ... at present, easyRNASeq and featureCounts()/Rsubread "overcount" reads and summarizeOverlap() "undercounts", relative to htseq-count.

Wei Shi (Rsubread author) has recently mentioned that this is the way he likes to count:
https://stat.ethz.ch/pipermail/bioconductor/2012-June/046440.html

I've spoken with Nico Delhomme (easyRNASeq author) offline and accurately treating mate reads is on his TODO list.

The undercounting in summarizeOverlaps(), as far as we can tell, is due to filtering singleton reads (one read of the mate is mapped, the other is not) and not counting them.

4. I have some results on memory usage and CPU time.  The packages available have different strategies to manage the tradeoff, so they vary widely.

A couple comments / suggestions:

1. From a user perspective, it would be optimal to have all options available: count singletons or not, overhang or no overhang, counting a pair as a single fragment, etc.

2. Another level of flexibility that would be useful is counting on junctions.  I think the infrastructure for this is now available: 
https://stat.ethz.ch/pipermail/bioconductor/2012-May/045506.html
Is there anyone working on a wrapper that just takes the reads (or pointer to BAM) and features and adds junction-level counts to the exon-level ones?

Please discuss.

These are observations, not criticisms.  I sincerely thank the contributors for their excellent packages; it's very nice to have multiple options available.

Best regards,
Mark

# Do not expect this to run.  Populate a TXT file 
# listing all the filenames (BAM/SAM)

# Given: metadata with file names
# starting from GTF files and BAM/SAM, derive tables of counts
sample.inform <- read.table("samples_bam_sam_etc.txt", header = TRUE, stringsAsFactors = FALSE)

## -------------------
## easyRNASeq
## -------------------

library(easyRNASeq)
library(BSgenome.Dmelanogaster.UCSC.dm3)

gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
bamFlsR <- sample.inform[, "bamfile_resort"]

c_ers <- easyRNASeq(organism = "Dmelanogaster", 
                    chr.sizes = as.list(seqlengths(Dmelanogaster)),
                    annotationMethod = "gtf", annotationFile= gtf, 
                    format = "bam",gapped = TRUE, count = "genes", 
                    summarization = "geneModels", 
                    filenames = bamFlsR, filesDirectory = dir)

## -------------------
## countOverlaps
## -------------------

library(GenomicFeatures)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
gnModel <- exonsBy(txdb, "gene")
bamFls <- paste(sample.inform[, "bamfile_s"], sep = "")

 UCSC2FlybaseGRanges<- function (GRanges) {
### Rename the chromosomes in<GRanges>  from UCSC conventions ('chr1',
### etc) to comport with Flybase conventions ('1', etc) by stripping
### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
GRanges
}

gnModelT <- UCSC2FlybaseGRanges(gnModel)

counter <- function(fl, gnModel)
{
    aln <- readGappedAlignments(fl)
    strand(aln) <- "*" # for strand-blind sample prep protocol
    hits <- countOverlaps(aln, gnModel)
    counts <- countOverlaps(gnModel, aln[hits==1])
    names(counts) <- names(gnModel)
    counts
}
c_conOver <- sapply(bamFls, counter, gnModel = gnModelT)

## -------------------
## summarizeOverlaps using UCSC annotation
## -------------------

library(GenomicRanges)
library(Rsamtools)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
library(GenomicFeatures)
bamFlsR <- sample.inform[, "bamfile_resort"]
# separate paired and unpaired
bamFlsR_pe <- bamFlsR[sample.inform[, "libtype"] == "PE"]
bamFlsR_se <- bamFlsR[sample.inform[, "libtype"] == "SE"]
bamFlsR_peL <- BamFileList(bamFlsR_pe)
bamFlsR_seL <- BamFileList(bamFlsR_se)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
gnModel <- exonsBy(txdb, "gene")

 UCSC2FlybaseGRanges<- function (GRanges) {
### Rename the chromosomes in<GRanges>  from UCSC conventions ('chr1',
### etc) to comport with Flybase conventions ('1', etc) by stripping
### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
GRanges
}

gnModelT <- UCSC2FlybaseGRanges(gnModel)
con_pe <- summarizeOverlaps(gnModelT, bamFlsR_peL, 
                            mode = "Union", singleEnd=FALSE, ignore.strand = TRUE)
con_se <- summarizeOverlaps(gnModelT, bamFlsR_seL, 
                            mode = "Union", singleEnd=TRUE, ignore.strand = TRUE)
c_summOver <- cbind(assays(con_pe)$counts, assays(con_se)$counts)
colnames(count_summOver) <- c(basename(bamFlsR_pe), basename(bamFlsR_se))

## -------------------
## Rsubread
## -------------------

library(Rsubread)
samFls <- sample.inform[, "samfile"]

gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
z <- read.table(gtf, sep="\t", stringsAsFactors=FALSE)
z <- z[z$V3=="exon",]
ids <- gsub(" gene_id ","",sapply(strsplit(z$V9,";"),.subset,1))
anno <- data.frame(entrezid=as.integer(factor(ids)), 
                   chromosome=z$V1, chr_start=z$V4, chr_stop=z$V5)

ID <- sort(unique(anno$entrezid))
genID <- ids[match(ID, anno$entrezid)]

c_rsub <- featureCounts(SAMfiles=samFls, annot=anno, type = "gene")$counts
rownames(count_rsub) <-genID
colnames(count_rsub) <-basename(samFls)

----------
Prof. Dr. Mark Robinson
Bioinformatics
Institute of Molecular Life Sciences
University of Zurich
Winterthurerstrasse 190
8057 Zurich
Switzerland

v: +41 44 635 4848
f: +41 44 635 6898
e: mark.robinson at imls.uzh.ch
o: Y11-J-16
w: http://tiny.cc/mrobin

----------
http://www.fgcz.ch/Bioconductor2012
http://www.eccb12.org/t5