[BioC] GTF file error when using easyRNAseq

Nicolas Delhomme delhomme at embl.de
Tue Nov 12 14:21:47 CET 2013


Hej Natalia!

This is not the first time that I’ve seen this error on the list, but I’ve not been able to reproduce it so far with my own data. Would you mind sharing some data offline, just an excerpt of your files would do. If that’s OK, I can create and give access to a folder on my box account.

I had already relaxed the constraint on parsing a gtf file in a previous update but forgot to reflect the changes in the error message. Only the  gene_id and transcript_id are actually mandatory. I would not expect any issue with the EnsEMBL gtf file, but I’ll have a look at why it fails for Gallus galls one and let you know asap.

This as nothing to do with this error, but by looking at your command line, I saw that you provide a character vector to the chrSize argument. This is not necessary as this information is extracted from the bam file in your case, so you can just drop the chrSizes = “chrSizes” from your command line. I’ve added some extra check in the method to detect this now. Thanks :-)

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 11 Nov 2013, at 16:31, Natalia [guest] <guest at bioconductor.org> wrote:

> 
> Dear all,
> I would like to make a count table to use it in DESeq. I´ve tried to use easyRNAseq but I have a problem with the annotation file. I’ve downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot is empty, I modified the file and added chr before the chromosome number. The next problem was this:
> 
> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name.
> 
> To solve this problem:
> - I deleted all the entries without gene_name (first example):
> 
> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000301221";
> 
> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891";
> 
> - I checked the chromosome numbers and deleted the entries that didn’t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf file, I don’t know if it is a problem):
> 
> - I searched for semicolons and single quotes ‘ in the gene names, but I didn’t find any on the final file.
> 
> - I deleted all the columns after gene_name.
> 
> So finally the annotation file entries look like this:
> chr1	protein_coding	exon	19962541	19963992
> 	.	+	.	gene_id "ENSGALG00000000003"; transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2";
> 
> Nothing works; the error message is always the same. So, I don’t know what else I can do. Could you please help me?
> Thank you in advance!
> Cheers
> 
> Natalia
> 
> 
> here is my code:
>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", chrSizes="chrSizes", annotationMethod="gtf", annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", summarization="geneModels", format="bam", gapped=TRUE, filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", conditions=conditions)
> Checking arguments... 
> Fetching annotations... 
> Read 334620 records
> Error en .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings,  : 
> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the required fields: gene_id, transcript_id, exon_number, gene_name.
> Además: Mensajes de aviso perdidos
> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes",  :
> Your organism has no mapping defined to perform the validity check for the UCSC compliance of the chromosome name.
> Defined organism's mapping can be listed using the 'knownOrganisms' function.
> To benefit from the validity check, you can provide a 'chr.map' to your 'easyRNASeq' function call.
> As you did not do so, 'validity.check' is turned off
> 2: In .Method(..., deparse.level = deparse.level) :
> number of columns of result is not a multiple of vector length (arg 1)
> 
>> traceback()
> 6: stop(paste("Your gtf file: ", filename, " does not contain all the required fields: ", 
>     paste(fields, collapse = ", "), ".", sep = ""))
> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = ignoreWarnings, 
>     ...)
> 4: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, 
>     ignoreWarnings = ignoreWarnings, ...)
> 3: fetchAnnotation(obj, method = annotationMethod, filename = annotationFile, 
>     ignoreWarnings = ignoreWarnings, ...)
> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", 
>     annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", 
>     count = "genes", summarization = "geneModels", format = "bam", 
>     gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), 
>     outputFormat = "DESeq", conditions = conditions)
> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = "chrSizes", 
>     annotationMethod = "gtf", annotationFile = " Gallus_gallus.Galgal4.73.gtf ", 
>     count = "genes", summarization = "geneModels", format = "bam", 
>     gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), 
>     outputFormat = "DESeq", conditions = conditions)
> 
> 
> -- output of sessionInfo(): 
> 
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> locale:
> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0                      easyRNASeq_1.8.1                     ShortRead_1.20.0                    
> [5] Rsamtools_1.14.1                     GenomicRanges_1.14.3                 DESeq_1.14.0                         lattice_0.20-23                     
> [9] locfit_1.5-9.1                       Biostrings_2.30.0                    XVector_0.2.0                        IRanges_1.20.4                      
> [13] edgeR_3.4.0                          limma_3.18.2                         biomaRt_2.18.0                       Biobase_2.22.0                      
> [17] genomeIntervals_1.18.0               BiocGenerics_0.8.0                   intervals_0.14.0                     BiocInstaller_1.12.0                
> 
> loaded via a namespace (and not attached):
> [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6         DBI_0.2-7            genefilter_1.44.0    geneplotter_1.40.0   grid_3.0.2          
> [8] hwriter_1.3          latticeExtra_0.6-26  LSD_2.5              RColorBrewer_1.0-5   RCurl_1.95-4.1       RSQLite_0.11.4       splines_3.0.2       
> [15] stats4_3.0.2         survival_2.37-4      tools_3.0.2          XML_3.98-1.1         xtable_1.7-1         zlibbioc_1.8.0  
> 
> 
> --
> Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list