[BioC] Create transcriptDb using gff3 files? - library GenomicFeatures and rtracklayer

Nicolas Delhomme delhomme at embl.de
Thu Apr 5 17:21:09 CEST 2012


Hi all,

Sorry I haven't read the whole thread, still I have a few comments that might be off the main topic then.

On 5 Apr 2012, at 17:01, Cook, Malcolm wrote:

> Supporting both Ensemble's GTF and GFF3 would be ideal.
> 
> Ensembl GTF would open up many genomes, including those in:
> 	ftp://ftp.ensembl.org/pub/release-66/gtf/
> 	ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/
> 	ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/
> 	ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/
> 	ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/
> 
> 
> Supporting Ensembl GTF would make it easy to distribute/archive the elements of a transcriptome analysis alongside a project/analysis in a generally useful format (i.e. IGV and other tools can work with it more or less directly)

In my package easyRNASeq, I already load Ensembl GTF files and convert them into GRanges / RangedData object. It's pretty straightforward. I guess that adapting the code to create a transcriptDb should be do-able.

> 
> Related note, I have learned that the BioMarts produced for EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS available.  Upshot: you'd best not depend upon being able to reproduce today's TranscriptDbFromBiomart  tomorrow.

I don't know where you learned that and how you meant it exactly, but using biomaRt, you can still access Ensembl version as old as of march 2009:  see http://mar2009.archive.ensembl.org/index.html. It's not straightforward to figure it out, but on the main Ensembl webpage, you can get the full list by clicking the "view in archive site" link at the bottom left of the papge. It redirects to this URL: http://www.ensembl.org/Help/ArchiveList. 
Then, to use biomaRt on a given archive, you need to change the host argument of useMart to the URL of the corresponding Ensembl archive as in: useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I recon that the biomaRT archive arguments does not work for that. I need to post something about this on the mailing list.

> 
> re: "typical gff3 files"...
> Flybase makes gff3 extracts and if my understanding is correct, have been diligent in "getting it right"

I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and convert them to GRanges/RangedData object, but all the merit goes to the readGff3 function from the genomeIntervals package. Reading a gff3 file with this function is extremely quick as is accessing the gffAttributes (performed at the C layer) .

Cheers,

Nico

> 
> Also, NCBI historically has tried to provide GFFx extracts, with oodles of caveats.  
> But, but, Last month they announced progress on improving their GFF3 offerings:  http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html
> Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/
> YMMV.
> 
> I too once hoped to find makeTranscriptDbFromGFF3 capability so as to allow easy tracking the head of Flybase's offerings, i.e. ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB2012_02/gff/ - alas I too have not followed up.
> 
> ~Malcolm
> 
> 
>> -----Original Message-----
>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>> bounces at r-project.org] On Behalf Of Marc Carlson
>> Sent: Wednesday, April 04, 2012 7:44 PM
>> To: bioconductor at r-project.org
>> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library
>> GenomicFeatures and rtracklayer
>> 
>> I was looking at this during the course, and this is on my TODO list for
>> the next release cycle.  I think it is long overdue and I don't think
>> that the community is going to get it done in spite of all the
>> enthusiasm.  There has not been time to do it before now but I am hoping
>> that will now change.  It should be simple enough in principle, but it
>> might not be exactly trivial as I have discovered (on closer inspection)
>> that the gff specification is not as concrete as one would like it to
>> be.  Also there have been several different versions.
>> 
>> Some things that can help speed me along:
>> 
>> 1) which version is most important?  gff3?  Or one of the other
>> versions?  It is likely that with the older versions we may not be able
>> to extract as much meaningful information.
>> 
>>  2) where is the best place to find some typical gff3 files for
>> examples?  This should not be difficult, but when I was looking before I
>> was finding that people were surprisingly stingy about sharing these.
>> 
>> 
>>   Marc
>> 
>> 
>> 
>> On 04/03/2012 03:57 PM, Michael Lawrence wrote:
>>> Marc was working on this during the course in Feb. Not sure what
>> happened
>>> to it. He said it was simple. Maybe just waiting for the release to pass.
>>> 
>>> Michael
>>> 
>>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou<
>>> mailinglist.honeypot at gmail.com>  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at cornell.edu>
>> wrote:
>>>>> Hi,
>>>>> 
>>>>> I am wondering if I could create a TranscriptDb object (library
>>>> GenomicFeatures) using a gff3 file.  I could read a gff3 file using
>>>> import.gff3, but I could not find a way to create TranscriptDb object from
>>>> the object from import.gff3.
>>>>> Two arguments for makeTranscriptDb are required: transcripts, splicings.
>>>> It does not seem to be easy to parse this information from the object
>> form
>>>> import.gff3.  I will appreciate any help.
>>>> 
>>>> As far as I know, this functionality isn't there yet ...
>>>> 
>>>> I once (early feb, 2012) suggested I might take a crack at making this
>>>> happen but haven't actually found the time to do it ... I'm not sure
>>>> anyone in bioc-core land (hi, Marc) has found the time to do it
>>>> either, so I think you're out of luck.
>>>> 
>>>> Sorry for that. But the good news is that I bet a patch that does this
>>>> would be welcome ;-)
>>>> 
>>>> -steve
>>>> 
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>  | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 	[[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list