[BioC] easyRNASeq package: error when summarizing per gene

Tue Sep 18 11:08:39 CEST 2012

Dear Nico,

thank you so much for your rapid reply!

I found the problem:  UCSC aligns some transcripts twice to the genome, when it is not clear where they actually come from. These transcripts have coordinates on two different chromosomes in the refGene table, and I guess that's why the gene models cannot be computed correctly. In the mm9 refGene table, 55 transcripts are affected by this. When I remove them from my annotation object, summarization= "geneModels" works fine!

Now I see that getting an adequate annotation IS the most critical step, as you wrote in your vignette.

Cheers,

Johanna

Johanna Schott, PhD candidate
AG Stoecklin
Posttranscriptional control of gene expression
DKFZ-A200
INF280
69120 Heidelberg
Tel.: 06221/54-6855
________________________________________
From: Nicolas Delhomme [delhomme at embl.de]
Sent: Monday, September 17, 2012 11:54 AM
To: Johanna Schott [guest]
Cc: bioconductor at r-project.org; Schott, Johanna
Subject: Re: easyRNASeq package: error when summarizing per gene

Dear Johanna,

The error is probably indeed due to your annotation file in combination with you chromosome size list. Every gene loci should be fully contained in the chromosomes, i.e. their coordinates should not be larger than the chromosome size you provide for the chromosome they're located on. I believe I'm checking this in the code, but you might have just found a configuration of providing the annotation and chromosome sizes arguments that I overlooked. The first use case in the "Use Cases" section of the vignette explains why the chromosome sizes are so important to the process. The development version of easyRNASeq (available around October 2nd (PDT time)), makes this a lot easier.

You can easily check if this is the case by running the following command:

lapply(names(seqlengths(Mmusculus)),function(chr,rngs,sizes){
        cat("processing chromosome",chr,"\n")
        sel <- space(exon_range) == chr
        stopifnot(all(start(exon_range[sel,]) <= sizes[chr]))
        stopifnot(all(end(exon_range[sel,]) <= sizes[chr]))
},exon_range,seqlengths(Mmusculus))

It should stop if any of your gene coordinate is larger that the chromosome size it's on. Note that I've just written this bit of code up in this email, so there might be typos. If there are and you can't figure them out, just let me know.

If there are no typos and it stops, then you would have to modify either your exon_range object or more easily your chromosome sizes one.

For this you could simply use the scanBamHeader (Rsamtools package) function to extract the chromosome sizes from your bam file as follow (change the first list to match your settings):

filesList <- dir("[your bam directory]",pattern="*\\.bam$",full.name=TRUE)

 ## read the headers
                headers <- scanBamHeader(filesList)

                ## Two sanity checks
                if(!all(sapply(headers,
                               function(header,expected){
                                 identical(sort(names(header$targets)),expected)
                               },sort(names(headers[[1]]$targets))))){
                  stop("Not all BAM files use the same chromosome names.")
                }

                chr.sizes <- headers[[1]]$targets
                if(!all(sapply(headers, function(header,chr.sizes){
                  identical(header$targets[match(names(chr.sizes),
                                                 names(header$targets))],chr.sizes)
                },chr.sizes))){
                  stop("The chromosome lengths differ between BAM files.")
                }

                ## check if we got some chr sizes at all
                if(length(chr.sizes)==0){
                  stop("No chromosome sizes could be determined from your BAM file(s).Is the BAM header present?\nIf not, you can use the 'chr.sizes' argument to provided the chromosome sizes information.")
                }

That's actually an excerpt of the easyRNASeq development version code to extract the chromosome size from the BAM files.

I think that this is the reason for the error. If not, we can take this discussion off the list to figure it out, provided you can give me access to your annotation object and to an excerpt of your data (that you could upload to my dropbox). This error should actually not occur in the development version anymore, but it would be great to give it a try after October 3rd again.

Let me know how it goes,

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On Sep 16, 2012, at 2:21 PM, Johanna Schott [guest] wrote:

>
> Dear list,
>
> while analyzing RNASeq data with the easyRNASeq package I get an error message concerning .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>  'x' values larger than vector length 'sum(width)'
>
> Here is my code:
>
>
>> library("easyRNASeq")
>
>> library(BSgenome.Mmusculus.UCSC.mm9)
>
>> annot <- load("gAnnot.rda")
>
>> count_table <- easyRNASeq(getwd(), filenames = "Sample1.bam",
> + readLength = 52L,
> + organism = "Mmusculus",
> + chr.sizes <- seqlengths(Mmusculus),
> + format = "bam",
> + annotationMethod = "env",
> + annotationObject = exon_range,
> + count = "genes",
> + summarization = "geneModels"
> + )
>
> Here is the error message:
>
> Fehler in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]],  :
>  Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl
> für Funktion 'unlist': Fehler in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>  'x' values larger than vector length 'sum(width)'
>
> This only happens when I am trying to summarize reads according to genes. For count = "transcripts", everything works fine. Does the problem come from my annotation file?
> For the moment, I am only interested in getting annotation to the UCSC RefSeq Track, table refGene. For this purpose, I had to make the annotation object myself, since I could not find any other way to
> get an annotation object with gene names and not only transcript names from UCSC. I did this by downloading the table for all exons in the refGene table as custom track and taking the gene names for the individual transcripts from the name2 column
> of the refGene table. My annotation object looks like this:
>
>> exon_range
> RangedData with 285524 rows and 4 value columns across 32 spaces
>             space                 ranges   |      strand   transcript          gene                                         exon
>          <factor>              <IRanges>   | <character>  <character>   <character>                                  <character>
> 1             chr1 [176160756, 176160919]   |           +    NM_011465         Spna1         NM_011465_exon_40_0_chr1_176160757_f
> 2             chr1 [164626408, 164626494]   |           - NM_001081290        Prrc2c      NM_001081290_exon_15_0_chr1_164626409_r
> 3             chr1 [ 16121374,  16122631]   |           +    NM_133832         Rdh10           NM_133832_exon_5_0_chr1_16121375_f
> 4             chr1 [ 21495764,  21495940]   |           - NM_001160139         Kcnq5       NM_001160139_exon_11_0_chr1_21495765_r
> 5             chr1 [ 21495764,  21495940]   |           -    NM_023872         Kcnq5          NM_023872_exon_10_0_chr1_21495765_r
> 6             chr1 [ 23855102,  23855410]   |           -    NM_028534         Smap1           NM_028534_exon_1_0_chr1_23855103_r
> 7             chr1 [ 26738645,  26742756]   |           - NM_001033764 4931408C20Rik        NM_001033764_exon_0_0_chr1_26738646_r
> 8             chr1 [ 36568720,  36569998]   |           + NM_001039551         Cnnm3        NM_001039551_exon_0_0_chr1_36568721_f
> 9             chr1 [ 36568720,  36569998]   |           +    NM_053186         Cnnm3           NM_053186_exon_0_0_chr1_36568721_f
> ...            ...                    ... ...         ...          ...           ...                                          ...
> 285516 chrY_random   [52089317, 52089373]   |           - NM_001037748     LOC380994 NM_001037748_exon_7_0_chrY_random_52089318_r
> 285517 chrY_random   [52515005, 52515028]   |           + NM_001025241     LOC434960 NM_001025241_exon_0_0_chrY_random_52515006_f
> 285518 chrY_random   [52516256, 52517353]   |           + NM_001025241     LOC434960 NM_001025241_exon_1_0_chrY_random_52516257_f
> 285519 chrY_random   [52590932, 52591790]   |           + NM_001160141  LOC100041223 NM_001160141_exon_0_0_chrY_random_52590933_f
> 285520 chrY_random   [52881631, 52882623]   |           + NM_001160137  LOC100039614 NM_001160137_exon_0_0_chrY_random_52881632_f
> 285521 chrY_random   [53819454, 53820453]   |           + NM_001160135  LOC100039574 NM_001160135_exon_0_0_chrY_random_53819455_f
> 285522 chrY_random   [54420148, 54420272]   |           + NM_001017394  LOC100039753 NM_001017394_exon_0_0_chrY_random_54420149_f
> 285523 chrY_random   [54421397, 54423069]   |           + NM_001017394  LOC100039753 NM_001017394_exon_1_0_chrY_random_54421398_f
> 285524 chrY_random   [58501954, 58502946]   |           + NM_001160137  LOC100039614 NM_001160137_exon_0_0_chrY_random_58501955_f
>
>
> Does someone know what this error means, and perhaps what I would have to change in my annotation object to avoid it?
>
> Thank you very much in advance,
>
> Johanna
>
> -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.2.5                   ShortRead_1.14.4                   latticeExtra_0.6-24
> [5] RColorBrewer_1.0-5                 lattice_0.20-6                     Rsamtools_1.8.6                    DESeq_1.8.3
> [9] locfit_1.5-8                       BSgenome_1.24.0                    GenomicRanges_1.8.13               Biostrings_2.24.1
> [13] IRanges_1.14.4                     edgeR_2.6.12                       limma_3.12.3                       biomaRt_2.12.0
> [17] Biobase_2.16.0                     genomeIntervals_1.12.0             BiocGenerics_0.2.0                 intervals_0.13.3
>
> loaded via a namespace (and not attached):
> [1] annotate_1.34.1      AnnotationDbi_1.18.3 bitops_1.0-4.1       DBI_0.2-5            genefilter_1.38.0    geneplotter_1.34.0   grid_2.15.1
> [8] hwriter_1.3          RCurl_1.91-1.1       RSQLite_0.11.1       splines_2.15.1       stats4_2.15.1        survival_2.36-14     XML_3.9-4.1
> [15] xtable_1.7-0         zlibbioc_1.2.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.