[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Marc Carlson mcarlson at fhcrc.org
Tue Jun 11 19:33:12 CEST 2013


Hi Thomas,

I will look into finding some way to accommodate these strange NCBI files.


   Marc


On 06/07/2013 06:16 PM, Thomas Girke wrote:
> No worries about the timing. I just wanted to point out that it would be
> worthwhile to address this some time in the future since it applies to a
> huge number of sequenced genomes. I totally understand the challenge Marc is
> facing supporting all these inconsistent data formats.
>
> Thomas
>
> On Fri, Jun 07, 2013 at 11:55:34PM +0000, Hervé Pagès wrote:
>> Hi Thomas, Malcolm,
>>
>> This GFF3 file doesn't pass validation using the online validator at:
>>
>>     http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
>>
>> but for a minor problem that doesn't have anything to do with what is
>> being discussed here (still, it's sad that an organization like NCBI
>> distributes invalid GFF3 files :-/ ).
>>
>> Anyway you're right Thomas that it would probably make sense to
>> be able to import those files with makeTranscriptDbFromGFF().
>> Don't know how hard that will be though. Marc being on vacation right
>> now, this won't be addressed until 2 weeks from now.
>>
>> I don't have an easy workaround to offer in the mean time, sorry...
>> but maybe someone has?
>>
>> Cheers,
>> H.
>>
>>
>> On 06/07/2013 04:33 PM, Thomas Girke wrote:
>>> The perl -> genbank -> gff -> R -> txdb approach seems to work in
>>> R-2.15.1 but not in R-3.0.0 anymore. The latter returns
>>>
>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
>>> extracting transcript information
>>> Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates
>>>
>>> Given that there are over 2500 bacteria genomes annotated this way at NCBI,
>>> a more generic solution to this problem would be very useful, I think.
>>>
>>> Thomas
>>>
>>>
>>> On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote:
>>>> FYI, bioperl includes bp_genbank2gff3.pl
>>>>
>>>> which when run as
>>>>
>>>>> bp_genbank2gff3.pl NC_011025.gbk
>>>> produces NC_011025.gbk.gff (attached)
>>>>
>>>> which loaded without error with transcript:
>>>>
>>>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
>>>> extracting transcript information
>>>> Extracting gene IDs
>>>> extracting transcript information
>>>> Processing splicing information for gff3 file.
>>>> Deducing exon rank from relative coordinates provided
>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>> Now generating chrominfo from available sequence names. No chromosome length information is available.
>>>> Warning messages:
>>>> 1: In .deduceExonRankings(exs, format = "gff") :
>>>>     Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>     None of the strings in your circ_seqs argument match your seqnames.
>>>>> txdb
>>>> TranscriptDb object:
>>>> | Db type: TranscriptDb
>>>> | Supporting package: GenomicFeatures
>>>> | Data source: NCBI
>>>> | Genus and Species: Some bact
>>>> | miRBase build ID: NA
>>>> | transcript_nrow: 631
>>>> | exon_nrow: 631
>>>> | cds_nrow: 631
>>>> | Db created by: GenomicFeatures package from Bioconductor
>>>> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
>>>> | GenomicFeatures version at creation time: 1.10.2
>>>> | RSQLite version at creation time: 0.11.2
>>>> | DBSCHEMAVERSION: 1.0
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
>>>> Sent: Friday, June 07, 2013 2:32 PM
>>>> To: 'Thomas Girke'; 'bioconductor at r-project.org'
>>>> Cc: 'Brandon Gallaher'
>>>> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
>>>>
>>>> Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
>>>>
>>>> That is your first problem.
>>>>
>>>> In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
>>>>
>>>> ##gff-version 3
>>>> #!gff-spec-version 1.20
>>>> #!processor NCBI annotwriter
>>>> ##sequence-region NC_011025.1 1 820453
>>>> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
>>>> NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
>>>> NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
>>>> NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
>>>> NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
>>>> NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0
>>>>
>>>>
>>>> The question is what to do.
>>>>
>>>> Not sure.
>>>>
>>>> Any other help?
>>>>
>>>> Good luck,
>>>>
>>>> ~Malcolm
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
>>>> Sent: Friday, June 07, 2013 12:52 PM
>>>> To: bioconductor at r-project.org
>>>> Cc: Brandon Gallaher
>>>> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
>>>>
>>>> It seems to me that makeTranscriptDbFromGFF does not yet work on the
>>>> bacteria GFFs from NCBI (perhaps others too):
>>>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
>>>>
>>>> ## For instance, the following
>>>> library(GenomicFeatures)
>>>> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
>>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
>>>>
>>>> ## returns this error:
>>>> extracting transcript information
>>>> Error in .prepareGFF3TXS(data) :
>>>>     No Transcript information present in gff file
>>>>
>>>> I guess this is because in bacteria GFF we don't have explicit
>>>> transcript annotations. There are hacks around this problem, but it
>>>> would be nice if this could be supported in the future right out of the
>>>> box. I apologize if I missed an existing solution for this.
>>>>
>>>> Best,
>>>>
>>>> Thomas
>>>>
>>>>> sessionInfo()
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  utils     datasets  grDevices methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>>    [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> -- 
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list