[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Thomas Girke thomas.girke at ucr.edu
Sat Jun 8 03:16:47 CEST 2013


No worries about the timing. I just wanted to point out that it would be
worthwhile to address this some time in the future since it applies to a
huge number of sequenced genomes. I totally understand the challenge Marc is
facing supporting all these inconsistent data formats. 

Thomas

On Fri, Jun 07, 2013 at 11:55:34PM +0000, Hervé Pagès wrote:
> Hi Thomas, Malcolm,
> 
> This GFF3 file doesn't pass validation using the online validator at:
> 
>    http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
> 
> but for a minor problem that doesn't have anything to do with what is
> being discussed here (still, it's sad that an organization like NCBI
> distributes invalid GFF3 files :-/ ).
> 
> Anyway you're right Thomas that it would probably make sense to
> be able to import those files with makeTranscriptDbFromGFF().
> Don't know how hard that will be though. Marc being on vacation right
> now, this won't be addressed until 2 weeks from now.
> 
> I don't have an easy workaround to offer in the mean time, sorry...
> but maybe someone has?
> 
> Cheers,
> H.
> 
> 
> On 06/07/2013 04:33 PM, Thomas Girke wrote:
> > The perl -> genbank -> gff -> R -> txdb approach seems to work in
> > R-2.15.1 but not in R-3.0.0 anymore. The latter returns
> >
> > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> > extracting transcript information
> > Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates
> >
> > Given that there are over 2500 bacteria genomes annotated this way at NCBI,
> > a more generic solution to this problem would be very useful, I think.
> >
> > Thomas
> >
> >
> > On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote:
> >> FYI, bioperl includes bp_genbank2gff3.pl
> >>
> >> which when run as
> >>
> >>> bp_genbank2gff3.pl NC_011025.gbk
> >>
> >> produces NC_011025.gbk.gff (attached)
> >>
> >> which loaded without error with transcript:
> >>
> >>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> >> extracting transcript information
> >> Extracting gene IDs
> >> extracting transcript information
> >> Processing splicing information for gff3 file.
> >> Deducing exon rank from relative coordinates provided
> >> Prepare the 'metadata' data frame ... metadata: OK
> >> Now generating chrominfo from available sequence names. No chromosome length information is available.
> >> Warning messages:
> >> 1: In .deduceExonRankings(exs, format = "gff") :
> >>    Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> >> 2: In matchCircularity(chroms, circ_seqs) :
> >>    None of the strings in your circ_seqs argument match your seqnames.
> >>> txdb
> >> TranscriptDb object:
> >> | Db type: TranscriptDb
> >> | Supporting package: GenomicFeatures
> >> | Data source: NCBI
> >> | Genus and Species: Some bact
> >> | miRBase build ID: NA
> >> | transcript_nrow: 631
> >> | exon_nrow: 631
> >> | cds_nrow: 631
> >> | Db created by: GenomicFeatures package from Bioconductor
> >> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
> >> | GenomicFeatures version at creation time: 1.10.2
> >> | RSQLite version at creation time: 0.11.2
> >> | DBSCHEMAVERSION: 1.0
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
> >> Sent: Friday, June 07, 2013 2:32 PM
> >> To: 'Thomas Girke'; 'bioconductor at r-project.org'
> >> Cc: 'Brandon Gallaher'
> >> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> >>
> >> Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
> >>
> >> That is your first problem.
> >>
> >> In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
> >>
> >> ##gff-version 3
> >> #!gff-spec-version 1.20
> >> #!processor NCBI annotwriter
> >> ##sequence-region NC_011025.1 1 820453
> >> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
> >> NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
> >> NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
> >> NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
> >> NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
> >> NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0
> >>
> >>
> >> The question is what to do.
> >>
> >> Not sure.
> >>
> >> Any other help?
> >>
> >> Good luck,
> >>
> >> ~Malcolm
> >>
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
> >> Sent: Friday, June 07, 2013 12:52 PM
> >> To: bioconductor at r-project.org
> >> Cc: Brandon Gallaher
> >> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> >>
> >> It seems to me that makeTranscriptDbFromGFF does not yet work on the
> >> bacteria GFFs from NCBI (perhaps others too):
> >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
> >>
> >> ## For instance, the following
> >> library(GenomicFeatures)
> >> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
> >> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
> >>
> >> ## returns this error:
> >> extracting transcript information
> >> Error in .prepareGFF3TXS(data) :
> >>    No Transcript information present in gff file
> >>
> >> I guess this is because in bacteria GFF we don't have explicit
> >> transcript annotations. There are hacks around this problem, but it
> >> would be nice if this could be supported in the future right out of the
> >> box. I apologize if I missed an existing solution for this.
> >>
> >> Best,
> >>
> >> Thomas
> >>
> >>> sessionInfo()
> >> R version 3.0.0 (2013-04-03)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>
> >> locale:
> >> [1] C
> >>
> >> attached base packages:
> >> [1] parallel  stats     graphics  utils     datasets  grDevices methods   base
> >>
> >> other attached packages:
> >> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0
> >>
> >> loaded via a namespace (and not attached):
> >>   [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list