[BioC] makeTranscriptDbFromGFF Error for UCSC GTF File

Wed Jul 2 21:59:11 CEST 2014

OK,

It looks like Herve may have found another problem with the code that 
tries to guess the order of the exon ranking when it is not provided by 
these files.  I will look into that.

But for now I think that people should instead do this (which works 
right now):

library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome="hg19",tablename="refGene")

And another thing that relates to this is that I really don't think you 
want to use the GTF files being generated here for making TranscriptDb 
objects.

The reason is because Simon is correct about the GTF file coming from 
UCSC here.  It's not a good one.  It may also have the problem that 
Simon describes with the bad IDs, but it definitely also has another 
serious problem that traces back to a deficiency in the GTF file format 
for this use case (which is: representing a transcriptome).

The problem is that you can make a valid GTF file and not include any 
exon ranking information.  That means that the file can tell you about 
ranges for all the exons but not ever tell you what order they should be 
in to make a corresponding transcript.  And that's not good as it is 
still woefully incomplete for specifying a transcriptome.  Now for some 
really simple organisms that have very little splicing this might be 
OK.  But for humans it is almost certainly not OK which is why my code 
is throwing all those warnings in your output.

This absence basically means that you then have to guess that the exons 
occur in the order that they occur along a chromosome.  Now sometimes 
you can tell that it is not OK to make this kind of guess even with the 
limited data that you have - such as when exons are from other 
chromosomes - and these kinds of cases get thrown out - under protest: 
with even more warnings.  But in any case, my feeling is that you should 
probably never use the arguments to make that kind of guess when dealing 
with something that has a transcriptome as complex as humans.  So if you 
have a GTF file and it's for a complex organism then I really think you 
should insist that that file contains exon rankings.

This is what I was alluding to when I said that the GTF file format has 
a use case that is more general than that of storing a complete 
transcriptome.  Can you use the format to store a legimate 
transcriptome?  Yes of course.

But can you rely on the file format to ensure that a valid file will 
always contains a complete transcriptome?  Unfortunately: the answer is 
no.  Some GTF files will just not have the information needed to specify 
a transcriptome since not all the information that you need to specify 
that is required by the format.

Hope this clarifies things,

  Marc

On 07/02/2014 11:34 AM, Simon Anders wrote:
> Hi
>
> On 02/07/14 20:17, Michael Lawrence wrote:
>>> In contrast, using GTF or GFF files for making TranscriptDb objects is
>>> always a little risky because many of these files will not have been
>>> created with the intention of holding a transcriptome as data (which 
>>> is the
>>> specific thing that a TranscriptDb object is meant to hold). This is
>>> because the GTF and GFF file formats were not initially intended for 
>>> the
>>> specific purpose of holding a transcriptome but were instead 
>>> intended to be
>>> something more general.
>>>
>>>
>> Actually GTF (Gene Transfer Format) files are designed specifically for
>> representing gene models, and we have no excuse for not parsing them
>> correctly. There have been some tweaks to attribute parsing (I thought
>> limited to GFF3) in devel, so there may be a difference between Herve's
>> devel result and Dario's release result.  I'll try to find some time to
>> look into this.
>
> The problem with GTF files produced by the UCSC Table Browser is that 
> they contain incorrect gene IDs: The gene_id attribute is always set 
> to the same value as the transcript_id, and these files hence cannot 
> be used to define gene models without manual correction (which would 
> be to remove the transcript number suffix from the gene IDs).
>
> Long ago, I have asked the UCSC Genome Browser help-desk why this is 
> and was told that it is a bug in the Table Browser which they cannot 
> fix, for some reason.
>
> Hence, I usually advise to not use these files.
>
>   Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor