[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Thomas Girke thomas.girke at ucr.edu
Sat Jun 8 01:33:42 CEST 2013


The perl -> genbank -> gff -> R -> txdb approach seems to work in
R-2.15.1 but not in R-3.0.0 anymore. The latter returns 

txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
extracting transcript information
Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates

Given that there are over 2500 bacteria genomes annotated this way at NCBI,
a more generic solution to this problem would be very useful, I think. 

Thomas


On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote:
> FYI, bioperl includes bp_genbank2gff3.pl
> 
> which when run as
> 
> > bp_genbank2gff3.pl NC_011025.gbk
> 
> produces NC_011025.gbk.gff (attached)
> 
> which loaded without error with transcript:
> 
> > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gff") :
>   Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
>   None of the strings in your circ_seqs argument match your seqnames.
> > txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: NCBI
> | Genus and Species: Some bact
> | miRBase build ID: NA
> | transcript_nrow: 631
> | exon_nrow: 631
> | cds_nrow: 631
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
> | GenomicFeatures version at creation time: 1.10.2
> | RSQLite version at creation time: 0.11.2
> | DBSCHEMAVERSION: 1.0
> 
> 
> 
> 
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
> Sent: Friday, June 07, 2013 2:32 PM
> To: 'Thomas Girke'; 'bioconductor at r-project.org'
> Cc: 'Brandon Gallaher'
> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> 
> Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
> 
> That is your first problem.
> 
> In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
> 
> ##gff-version 3
> #!gff-spec-version 1.20
> #!processor NCBI annotwriter
> ##sequence-region NC_011025.1 1 820453
> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
> NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
> NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
> NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
> NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
> NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0
> 
> 
> The question is what to do.
> 
> Not sure.
> 
> Any other help?
> 
> Good luck,
> 
> ~Malcolm
> 
> 
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
> Sent: Friday, June 07, 2013 12:52 PM
> To: bioconductor at r-project.org
> Cc: Brandon Gallaher
> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> 
> It seems to me that makeTranscriptDbFromGFF does not yet work on the 
> bacteria GFFs from NCBI (perhaps others too): 
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
> 
> ## For instance, the following 
> library(GenomicFeatures)
> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
>        
> ## returns this error:
> extracting transcript information
> Error in .prepareGFF3TXS(data) :
>   No Transcript information present in gff file
> 
> I guess this is because in bacteria GFF we don't have explicit
> transcript annotations. There are hacks around this problem, but it
> would be nice if this could be supported in the future right out of the
> box. I apologize if I missed an existing solution for this.
> 
> Best,
> 
> Thomas
> 
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] parallel  stats     graphics  utils     datasets  grDevices methods   base
> 
> other attached packages:
> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0
> 
> loaded via a namespace (and not attached):
>  [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list