[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Sat Jun 8 01:55:34 CEST 2013

Hi Thomas, Malcolm,

This GFF3 file doesn't pass validation using the online validator at:

   http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

but for a minor problem that doesn't have anything to do with what is
being discussed here (still, it's sad that an organization like NCBI
distributes invalid GFF3 files :-/ ).

Anyway you're right Thomas that it would probably make sense to
be able to import those files with makeTranscriptDbFromGFF().
Don't know how hard that will be though. Marc being on vacation right
now, this won't be addressed until 2 weeks from now.

I don't have an easy workaround to offer in the mean time, sorry...
but maybe someone has?

Cheers,
H.

On 06/07/2013 04:33 PM, Thomas Girke wrote:
> The perl -> genbank -> gff -> R -> txdb approach seems to work in
> R-2.15.1 but not in R-3.0.0 anymore. The latter returns
>
> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> extracting transcript information
> Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates
>
> Given that there are over 2500 bacteria genomes annotated this way at NCBI,
> a more generic solution to this problem would be very useful, I think.
>
> Thomas
>
>
> On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote:
>> FYI, bioperl includes bp_genbank2gff3.pl
>>
>> which when run as
>>
>>> bp_genbank2gff3.pl NC_011025.gbk
>>
>> produces NC_011025.gbk.gff (attached)
>>
>> which loaded without error with transcript:
>>
>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
>> extracting transcript information
>> Extracting gene IDs
>> extracting transcript information
>> Processing splicing information for gff3 file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Now generating chrominfo from available sequence names. No chromosome length information is available.
>> Warning messages:
>> 1: In .deduceExonRankings(exs, format = "gff") :
>>    Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
>> 2: In matchCircularity(chroms, circ_seqs) :
>>    None of the strings in your circ_seqs argument match your seqnames.
>>> txdb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Supporting package: GenomicFeatures
>> | Data source: NCBI
>> | Genus and Species: Some bact
>> | miRBase build ID: NA
>> | transcript_nrow: 631
>> | exon_nrow: 631
>> | cds_nrow: 631
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
>> | GenomicFeatures version at creation time: 1.10.2
>> | RSQLite version at creation time: 0.11.2
>> | DBSCHEMAVERSION: 1.0
>>
>>
>>
>>
>> -----Original Message-----
>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
>> Sent: Friday, June 07, 2013 2:32 PM
>> To: 'Thomas Girke'; 'bioconductor at r-project.org'
>> Cc: 'Brandon Gallaher'
>> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
>>
>> Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
>>
>> That is your first problem.
>>
>> In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
>>
>> ##gff-version 3
>> #!gff-spec-version 1.20
>> #!processor NCBI annotwriter
>> ##sequence-region NC_011025.1 1 820453
>> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
>> NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
>> NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
>> NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
>> NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
>> NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0
>>
>>
>> The question is what to do.
>>
>> Not sure.
>>
>> Any other help?
>>
>> Good luck,
>>
>> ~Malcolm
>>
>>
>> -----Original Message-----
>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
>> Sent: Friday, June 07, 2013 12:52 PM
>> To: bioconductor at r-project.org
>> Cc: Brandon Gallaher
>> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
>>
>> It seems to me that makeTranscriptDbFromGFF does not yet work on the
>> bacteria GFFs from NCBI (perhaps others too):
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
>>
>> ## For instance, the following
>> library(GenomicFeatures)
>> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
>>
>> ## returns this error:
>> extracting transcript information
>> Error in .prepareGFF3TXS(data) :
>>    No Transcript information present in gff file
>>
>> I guess this is because in bacteria GFF we don't have explicit
>> transcript annotations. There are hacks around this problem, but it
>> would be nice if this could be supported in the future right out of the
>> box. I apologize if I missed an existing solution for this.
>>
>> Best,
>>
>> Thomas
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  utils     datasets  grDevices methods   base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>>   [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319