[BioC] Rsubread featureCounts unexpected strand-specifc counts with Ensembl GTF

Wed Aug 27 16:53:58 CEST 2014

Hi,

I am trying to use featureCounts to perform strand-specifc RNA-Seq read counting with an Ensembl GTF file. However, it seems to count reads from each strand separately rather than for each gene from the strand where the gene is located. I have created a small BAM that contains reads from only two genes (one from each strand) to illustrate the problem.

First, if I specify strandSpefic = 0 I get 97.7% of the reads aligned to two genes:
a = featureCounts("merged.bam", annot.ext = "Homo_sapiens.GRCh38.76.gtf",
strandSpecific = 0, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)

> a$stat
                      Status counts.merged.bam
1                   Assigned             16179
2       Unassigned_Ambiguity                 0
3    Unassigned_MultiMapping                56
4      Unassigned_NoFeatures               335

And the reads counts for the two genes are:
> a$counts[c("ENSG00000066336","ENSG00000136869"),]
ENSG00000066336 ENSG00000136869 
          11759            4419 

However, when I try to specifiy strandSpecific = 1 or 2 I get reads from only one strand counted in either case
b = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 1, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)

> b$stat
                      Status counts.merged.bam
1                   Assigned              5295
2       Unassigned_Ambiguity                 0
3    Unassigned_MultiMapping                56
4      Unassigned_NoFeatures             11219

> b$counts[c("ENSG00000066336","ENSG00000136869"),]
ENSG00000066336 ENSG00000136869 
            952            4343 

ENSG00000136869 is located on the forward strand, gets high counts with strandSpecific = 1.

#####

c = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 2, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)

> c$stat
                      Status counts.merged.bam
1                   Assigned             10884
2       Unassigned_Ambiguity                 0
3    Unassigned_MultiMapping                56
4      Unassigned_NoFeatures              5630

> c$counts[c("ENSG00000066336","ENSG00000136869"),]
ENSG00000066336 ENSG00000136869 
          10807              76 

ENSG00000066336 is located on the reverse strand, gets high counts with strandSpecific = 2.

Furthermore, 
a$counts[c("ENSG00000066336","ENSG00000136869"),] = c$counts[c("ENSG00000066336","ENSG00000136869"),] +  b$counts[c("ENSG00000066336","ENSG00000136869"),]

I downloaded the Ensembl 76 GTF file from here:
ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens

The sample BAM files is available from here:
https://dl.dropboxusercontent.com/u/6006832/merged.bam

Is this a bug of featureCounts or am I doing something obviously wrong?

Kind regards,
Kaur Alasoo
PhD student at Sanger Institute

 -- output of sessionInfo(): 

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.12.6      BiocInstaller_1.12.1

loaded via a namespace (and not attached):
[1] tools_3.0.2

--
Sent via the guest posting facility at bioconductor.org.