[BioC] countOverlaps Warnings

Pankaj Agarwal p.agarwal at duke.edu
Thu Apr 24 04:40:22 CEST 2014


Valerie,

I used summarizeOverlaps() instead, since it seemed simpler to use.  I also used htseq-count to get the counts using the same parameters, but I get very different results.
>From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level.
I would feel much more comfortable with the results if they were both at least close.

For both I used the UCSC GTF file in the iGenomes distribution.
For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools).
Example of counts generated by the two methods are given after the code:

The code for htseq-count that I used is as follows (once for each sample):

samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count

 The code for summarizeOverlaps() is as follows:

> library(GenomicFeatures) 
Loading required package: BiocGenerics 
Loading required package: parallel 
...
> library(Rsamtools) 
Loading required package: Biostrings 
> library(rtracklayer) 
> library(GenomicRanges) 
> 
> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf", format="gtf") 
extracting transcript information 
Estimating transcript ranges. 
Extracting gene IDs 
Processing splicing information for gtf file. 
Deducing exon rank from relative coordinates provided 
Prepare the 'metadata' data frame ... metadata: OK 
Now generating chrominfo from available sequence names. No chromosome length information is available. 
Warning messages: 
1: In .deduceExonRankings(exs, format = "gtf") : 
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 
2: In matchCircularity(chroms, circ_seqs) : 
  None of the strings in your circ_seqs argument match your seqnames. 
> 

- are these warnings message problems? 

> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") 
>
> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") 
> 
> eByg <- exonsBy(txdb, by="gene") 
>
> head(eByg) 
GRangesList of length 6: 
$A1BG 
GRanges with 8 ranges and 2 metadata columns: 
      seqnames               ranges strand |   exon_id   exon_name 
         <Rle>            <IRanges>  <Rle> | <integer> <character> 
  [1]    chr19 [58858172, 58858395]      - |    163177        <NA> 
  [2]    chr19 [58858719, 58859006]      - |    163178        <NA> 
>
> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") 
>
> samples 
[1] "A.sorted" "B.sorted"         
[3] "C.sorted" "D.sorted"   
>
> samplespath 
[1] "../A.sorted.bam" 
[2] "../B.bam"         
[3] "../C.sorted.bam"       
[4] "../D.sorted.bam"       
> 
> names(samplespath) <- samples
> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) 

I am not sure what yieldSize means, just used it from the example I followed.

> bfl 
BamFileList of length 4 
names(4): A.sorted ... B.sorted 
> 
> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) 
>
> head(countDF_intsectionNotEmpty) 
class: SummarizedExperiment 
dim: 1 4 
exptData(0): 
assays(1): counts 
rownames(1): A1BG 
rowData metadata column names(0): 
colnames(4): A1.sorted B.sorted  C.sorted D.sorted 
colData names(0): 
> 
> class(countDF_intsectionNotEmpty) 
[1] "SummarizedExperiment" 
attr(,"package") 
[1] "GenomicRanges" 
> 
> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts 
> colnames(countDF_intsectionNotEmpty) <- samples 
> countDF_intsectionNotEmpty[1:4,] 
         	A.sorted   B.sorted   C.sorted   D.sorted
A1BG            13         119 	        162	             136
A1BG-AS1   2         112 	        52	             85
A1CF             3         28 	       0	            5
A2M              2543  17593       47                233
...

The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above:

         	       A.bam  B. bam  C. bam  D. bam
A1BG            7            67 	      90	      72
A1BG-AS1  1            69 	     41	     51
A1CF             2          16 	    0	     3
A2M	      1536    10866     34            173
 ...
...

I would appreciate your feedback.

Thanks,

- Pankaj

-----Original Message-----
From: Valerie Obenchain [mailto:vobencha at fhcrc.org] 
Sent: Wednesday, April 23, 2014 12:50 PM
To: Pankaj Agarwal; bioconductor at r-project.org
Subject: Re: [BioC] countOverlaps Warnings

Pankaj,

How did it go? Were you able to get readGAlignmentPairs() or
readGAlignmentsList() to work for your case?

Valerie


On 04/21/2014 01:32 PM, Valerie Obenchain wrote:
> Hi Pankaj,
>
> There are several options for counting paired-end reads.
>
> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate 
> for paired-end data and are in the GenomicAlignments package. They use 
> the same counting algorithm but return the data in different containers.
> readGAlignmentPairs() returns only pairs while readGAlignmentsList() 
> returns pairs as well as singletons, reads with unmapped pairs etc. 
> See the ?readGAlignments man page for details.
>
> Another counting function is summarizeOverlaps() which uses the above 
> functions 'under the hood'. It returns the counts in the 'assays' slot 
> of a SummarizedExperiment object. Since you have multiple bam files to 
> count you may want to use the BamFileList method.
>
> The man page has examples of counting paired-end bam files.
> library(Rsamtools)
> ?summarizeOverlaps
>
> Other packages that offer count functions are Rsubread and easyRNASeq.
>
> Valerie
>
>
>
>
> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote:
>> Hi,
>>
>> I am analyzing a matched pair tumor/normal rna-seq data set.  After 
>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the 
>> counts for each of the samples using the iGenomes UCSC GTF file.  I 
>> came across the following tutorial which shows different ways to do 
>> it in Bioconductor:
>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De
>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf
>>
>>
>> Following along slide 28 "Read Counting with countOverlaps", I am 
>> able to generate the counts but get the following Warnings.  I just 
>> want to be sure that there is nothing to be worried about because I 
>> don't understand the meaning of these warnings.  My code is as follows:
>>
>>> library(GenomicFeatures)
>> Loading required package: AnnotationDbi
>>> library(Rsamtools)
>>> library(rtracklayer)
>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>>> eByg <- exonsBy(txdb, by="gene")
>>> head(eByg)
>> GRangesList of length 6:
>> $A1BG
>> GRanges with 8 ranges and 2 metadata columns:
>>        seqnames               ranges strand |   exon_id   exon_name
>>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>>    [1]    chr19 [58858172, 58858395]      - |    163177        <NA>
>>    [2]    chr19 [58858719, 58859006]      - |    163178        <NA>
>>    [3]    chr19 [58861736, 58862017]      - |    163179        <NA>
>>    [4]    chr19 [58862757, 58863053]      - |    163180        <NA>
>>    [5]    chr19 [58863649, 58863921]      - |    163181        <NA>
>>    [6]    chr19 [58864294, 58864563]      - |    163182        <NA>
>>    [7]    chr19 [58864658, 58864693]      - |    163183        <NA>
>>    [8]    chr19 [58864770, 58864865]      - |    163184        <NA>
>>
>> $A1BG-AS1
>> GRanges with 4 ranges and 2 metadata columns:
>>        seqnames               ranges strand | exon_id exon_name
>>    [1]    chr19 [58863336, 58864410]      + |  156640      <NA>
>>    [2]    chr19 [58864745, 58864840]      + |  156641      <NA>
>>    [3]    chr19 [58865080, 58865223]      + |  156642      <NA>
>>    [4]    chr19 [58865735, 58866549]      + |  156643      <NA>
>>
>> ...
>> <4 more elements>
>> ---
>> seqlengths:
>>                   chr12                  chr8 ...  chr1_gl000192_random
>>                      NA                    NA ...                    NA
>>> samples = c("BRPC13-1118_L1.D710_501.sorted",
>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted",
>>> "DU06_PBMC_RNA_L1.sorted")
>>>
>>> samples
>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted"
>> [3] "DU04_PBMC_RNA_L1.sorted"        "DU06_PBMC_RNA_L1.sorted"
>>>
>>> samplespath <- paste("./", samples, ".bam", sep="")
>>>
>>> samplespath
>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam"
>> [2] "./BRPC_13-764_L2.sorted.bam"
>> [3] "./DU04_PBMC_RNA_L1.sorted.bam"
>> [4] "./DU06_PBMC_RNA_L1.sorted.bam"
>>>
>>> names(samplespath) <- samples
>>>
>>> samplespath
>>          BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted 
>> "./BRPC13-1118_L1.D710_501.sorted.bam"
>> "./BRPC_13-764_L2.sorted.bam"
>>                 DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted
>>         "./DU04_PBMC_RNA_L1.sorted.bam"
>> "./DU06_PBMC_RNA_L1.sorted.bam"
>>>
>>> countDF <- data.frame(row.names=names(eByg))
>>>
>>> countDF
>> data frame with 0 columns and 23710 rows
>>>
>>> dim(countDF)
>> [1] 23710     0
>>>
>>> for(i in samplespath) {
>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, 
>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) }
>> Warning messages:
>> 1: In .Seqinfo.mergexy(x, y) :
>>    Each of the 2 combined objects has sequence levels not in the other:
>>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, 
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, 
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, 
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, 
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, 
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, 
>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, 
>> chr4_gl000194_random, chr1_gl000192_random
>>    - in 'y': chrM
>>    Make sure to always combine/compare objects based on the same 
>> reference
>>    genome (use suppressWarnings() to suppress this warning).
>> 2: In .Seqinfo.mergexy(x, y) :
>>    Each of the 2 combined objects has sequence levels not in the other:
>>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, 
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, 
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, 
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, 
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, 
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, 
>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, 
>> chr4_gl000194_random, chr1_gl000192_random
>>    - in 'y': chrM
>>    Make sure to always combine/compare objects based on the same 
>> reference
>>    genome (use suppressWarnings() to suppress this warning).
>> 3: In .Seqinfo.mergexy(x, y) :
>>    Each of the 2 combined objects has sequence levels not in the other:
>>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, 
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, 
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, 
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, 
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, 
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, 
>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, 
>> chr4_gl000194_random, chr1_gl000192_random
>>    - in 'y': chrM
>>    Make sure to always combine/compare objects based on the same 
>> reference
>>    genome (use suppressWarnings() to suppress this warning).
>> 4: In .Seqinfo.mergexy(x, y) :
>>    Each of the 2 combined objects has sequence levels not in the other:
>>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, 
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, 
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, 
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, 
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, 
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, 
>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, 
>> chr4_gl000194_random, chr1_gl000192_random
>>    - in 'y': chrM
>>    Make sure to always combine/compare objects based on the same 
>> reference
>>    genome (use suppressWarnings() to suppress this warning).
>>>
>>
>> I also wanted to verify if what I am doing is correct for paired end 
>> reads.
>>
>> Thanks,
>>
>> - Pankaj
>>
>>
>> =======================================
>> sessionInfo()
>> R version 3.0.3 (2014-03-06)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>


--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list