[BioC] makeTranscriptDbFromGFF for Unstranded Transcripts

Thu May 9 04:10:38 CEST 2013

Dario,

On 05/08/2013 06:00 PM, Dario Strbenac wrote:
>> The same could probably be said of GTF and GFF files, and I wonder if
>> storing a set of unstranded mRNAs, exons, CDSs in those files is
>> considered valid.
>
>  From the specification, it is.
>
> strand - Valid entries include '+', '-', or '.' (for don't know/don't care).

Sure. But it's more complicated than that. For example, according to
the GFF3 spec here

   http://www.sequenceontology.org/gff3.shtml

the phase (column 8) is required for CDS features, and the phase can
only be interpreted correctly if the strand is known. So it's like the
strand is implicitly required for CDS features, and consequently also
for the exons in a protein-coding mRNA.

>
>> Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and
>> GFF files, we would need to automatically replace all missing strands
>> by a + or a -.
>
> It is better if it retains the error result, so there is no ambiguity. Adding a sentence about this to the help file would be useful, since users will expect that it reads in all valid GTF and GFF files.

The requirement that the strand must be "+" or "-" in a TranscriptDb
is documented in the man page for makeTranscriptDb(), which
makeTranscriptDbFromGFF() is based on. Could certainly be made more
visible.

>
>>
>> makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number")
>
>> Ok, so you've managed to store the exon rank in your file. But that
>> means that you must have *implicitly* chosen a strand for your exons
>> right?
>
> Cufflinks can infer the strand of the transcript for multi-exon transcripts by looking for the canonical GT-AG splice site in reads mapping across an intron, but not for single exon genes. So, it outputs a strand for some genes and not for others.

I see. So if I understand correctly, it produces a GTF file where
multi-exon transcripts are stranded, but not single-exon transcripts.
So we could in theory support that in makeTranscriptDbFromGFF(), and
return a TranscriptDb object where some single-exon transcripts and
their exon are not stranded. However that would be a *big* change.
Not only in makeTranscriptDbFromGFF() but in the schema of the
TranscriptDb. It would almost certainly break some of the existing
code that assumes that the features in a TranscriptDb are stranded.
And changing that code to gracefully handle that would introduce
extra complexity.

It's doable but is it worth it? Not a decision to be taken lightly.
What do other people think?

H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319