[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Cook, Malcolm MEC at stowers.org
Fri Jun 7 21:56:32 CEST 2013


FYI, bioperl includes bp_genbank2gff3.pl

which when run as

> bp_genbank2gff3.pl NC_011025.gbk

produces NC_011025.gbk.gff (attached)

which loaded without error with transcript:

> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
extracting transcript information
Extracting gene IDs
extracting transcript information
Processing splicing information for gff3 file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome length information is available.
Warning messages:
1: In .deduceExonRankings(exs, format = "gff") :
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: NCBI
| Genus and Species: Some bact
| miRBase build ID: NA
| transcript_nrow: 631
| exon_nrow: 631
| cds_nrow: 631
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
| GenomicFeatures version at creation time: 1.10.2
| RSQLite version at creation time: 0.11.2
| DBSCHEMAVERSION: 1.0




-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
Sent: Friday, June 07, 2013 2:32 PM
To: 'Thomas Girke'; 'bioconductor at r-project.org'
Cc: 'Brandon Gallaher'
Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)

That is your first problem.

In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_011025.1 1 820453
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0


The question is what to do.

Not sure.

Any other help?

Good luck,

~Malcolm


-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
Sent: Friday, June 07, 2013 12:52 PM
To: bioconductor at r-project.org
Cc: Brandon Gallaher
Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

It seems to me that makeTranscriptDbFromGFF does not yet work on the 
bacteria GFFs from NCBI (perhaps others too): 
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

## For instance, the following 
library(GenomicFeatures)
download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
       
## returns this error:
extracting transcript information
Error in .prepareGFF3TXS(data) :
  No Transcript information present in gff file

I guess this is because in bacteria GFF we don't have explicit
transcript annotations. There are hacks around this problem, but it
would be nice if this could be supported in the future right out of the
box. I apologize if I missed an existing solution for this.

Best,

Thomas

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  utils     datasets  grDevices methods   base

other attached packages:
[1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


More information about the Bioconductor mailing list