[BioC] Create transcriptDb using gff3 files? - library GenomicFeatures and rtracklayer

Nicolas Delhomme delhomme at embl.de
Thu Apr 5 17:51:02 CEST 2012


Hi Malcom,

Thanks for the clarification,

Nico

On 5 Apr 2012, at 17:41, Cook, Malcolm wrote:

>> Hi all,
>> 
>> Sorry I haven't read the whole thread, still I have a few comments that might
>> be off the main topic then.
>> 
>> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote:
>> 
>>> Supporting both Ensemble's GTF and GFF3 would be ideal.
>>> 
>>> Ensembl GTF would open up many genomes, including those in:
>>> 	ftp://ftp.ensembl.org/pub/release-66/gtf/
>>> 	ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/
>>> 	ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/
>>> 	ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/
>>> 	ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/
>>> 
>>> 
>>> Supporting Ensembl GTF would make it easy to distribute/archive the
>> elements of a transcriptome analysis alongside a project/analysis in a
>> generally useful format (i.e. IGV and other tools can work with it more or less
>> directly)
>> 
>> In my package easyRNASeq, I already load Ensembl GTF files and convert
>> them into GRanges / RangedData object. It's pretty straightforward. I guess
>> that adapting the code to create a transcriptDb should be do-able.
>> 
>>> 
>>> Related note, I have learned that the BioMarts produced for
>> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS
>> available.  Upshot: you'd best not depend upon being able to reproduce
>> today's TranscriptDbFromBiomart  tomorrow.
>> 
>> I don't know where you learned that and how you meant it exactly, but using
>> biomaRt, you can still access Ensembl version as old as of march 2009:  see
>> http://mar2009.archive.ensembl.org/index.html. 
> 
> I learned it via an email exchange with Ensembl Genomes support
> 
> 	Hello Malcolm,
> 	No, I am afraid that for Ensembl Genomes we don't make older versions available through an Archive! site, like we do for Ensembl.
> 	-- 
> 	With kind regards,
> 	Bert Overduin, Ph.D.
> 	(Ensembl Helpdesk)
> 
> I realize this refers to the Ensembl Genomes web site, not the BioMart per se, however I'm pretty sure it extends.
> 
> Note, EnsemblGenomes sites do NOT have the same archive policy as the main Ensembl site.
> 
> I would like to be able to more clearly refer to this distinction via an on-line policy document, or some such, and would welcome a reference if there is one to be had.....
> 
>> It's not straightforward to
>> figure it out, but on the main Ensembl webpage, you can get the full list by
>> clicking the "view in archive site" link at the bottom left of the papge. It
>> redirects to this URL: http://www.ensembl.org/Help/ArchiveList.
>> Then, to use biomaRt on a given archive, you need to change the host
>> argument of useMart to the URL of the corresponding Ensembl archive as in:
>> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"
>> ). I recon that the biomaRT archive arguments does not work for that. I need
>> to post something about this on the mailing list.
> 
> 
> 
>> 
>>> 
>>> re: "typical gff3 files"...
>>> Flybase makes gff3 extracts and if my understanding is correct, have been
>> diligent in "getting it right"
>> 
>> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and
>> convert them to GRanges/RangedData object, but all the merit goes to the
>> readGff3 function from the genomeIntervals package. Reading a gff3 file
>> with this function is extremely quick as is accessing the gffAttributes
>> (performed at the C layer) .
>> 
>> Cheers,
>> 
>> Nico
>> 
>>> 
>>> Also, NCBI historically has tried to provide GFFx extracts, with oodles of
>> caveats.
>>> But, but, Last month they announced progress on improving their GFF3
>> offerings:  http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html
>>> Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/
>>> YMMV.
>>> 
>>> I too once hoped to find makeTranscriptDbFromGFF3 capability so as to
>> allow easy tracking the head of Flybase's offerings, i.e.
>> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB201
>> 2_02/gff/ - alas I too have not followed up.
>>> 
>>> ~Malcolm
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>>>> bounces at r-project.org] On Behalf Of Marc Carlson
>>>> Sent: Wednesday, April 04, 2012 7:44 PM
>>>> To: bioconductor at r-project.org
>>>> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library
>>>> GenomicFeatures and rtracklayer
>>>> 
>>>> I was looking at this during the course, and this is on my TODO list for
>>>> the next release cycle.  I think it is long overdue and I don't think
>>>> that the community is going to get it done in spite of all the
>>>> enthusiasm.  There has not been time to do it before now but I am hoping
>>>> that will now change.  It should be simple enough in principle, but it
>>>> might not be exactly trivial as I have discovered (on closer inspection)
>>>> that the gff specification is not as concrete as one would like it to
>>>> be.  Also there have been several different versions.
>>>> 
>>>> Some things that can help speed me along:
>>>> 
>>>> 1) which version is most important?  gff3?  Or one of the other
>>>> versions?  It is likely that with the older versions we may not be able
>>>> to extract as much meaningful information.
>>>> 
>>>> 2) where is the best place to find some typical gff3 files for
>>>> examples?  This should not be difficult, but when I was looking before I
>>>> was finding that people were surprisingly stingy about sharing these.
>>>> 
>>>> 
>>>>  Marc
>>>> 
>>>> 
>>>> 
>>>> On 04/03/2012 03:57 PM, Michael Lawrence wrote:
>>>>> Marc was working on this during the course in Feb. Not sure what
>>>> happened
>>>>> to it. He said it was simple. Maybe just waiting for the release to pass.
>>>>> 
>>>>> Michael
>>>>> 
>>>>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou<
>>>>> mailinglist.honeypot at gmail.com>  wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at cornell.edu>
>>>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I am wondering if I could create a TranscriptDb object (library
>>>>>> GenomicFeatures) using a gff3 file.  I could read a gff3 file using
>>>>>> import.gff3, but I could not find a way to create TranscriptDb object
>> from
>>>>>> the object from import.gff3.
>>>>>>> Two arguments for makeTranscriptDb are required: transcripts,
>> splicings.
>>>>>> It does not seem to be easy to parse this information from the object
>>>> form
>>>>>> import.gff3.  I will appreciate any help.
>>>>>> 
>>>>>> As far as I know, this functionality isn't there yet ...
>>>>>> 
>>>>>> I once (early feb, 2012) suggested I might take a crack at making this
>>>>>> happen but haven't actually found the time to do it ... I'm not sure
>>>>>> anyone in bioc-core land (hi, Marc) has found the time to do it
>>>>>> either, so I think you're out of luck.
>>>>>> 
>>>>>> Sorry for that. But the good news is that I bet a patch that does this
>>>>>> would be welcome ;-)
>>>>>> 
>>>>>> -steve
>>>>>> 
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>> | Memorial Sloan-Kettering Cancer Center
>>>>>> | Weill Medical College of Cornell University
>>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> 
>>>>> 	[[alternative HTML version deleted]]
>>>>> 
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 



More information about the Bioconductor mailing list