[BioC] gff files: how to tell if right-open interval convention used?

Simon Anders anders at embl.de
Fri Apr 1 10:53:13 CEST 2011


Hi

I guess this all depends on which GFF specification you consider 
authoritative. Different institution have published slightly 
contradictory specs.

According to the Sanger web site, the original GFF spec is due to Durbin 
and Haussler, and the current version (according to Sanger) is GFF2, 
which, at
   http://www.sanger.ac.uk/resources/software/gff/spec.html
specifies:

"<start>, <end>: Integers. <start> must be less than or equal to <end>. 
Sequence numbering starts at 1, so these numbers should be between 1 and 
the length of the relevant sequence, inclusive. "

A feature is typically a stretch of consecutive base pairs which make up 
something, say, an exon, so I guess they did not have zero-length 
features in mind when writing this. So, if start is less than or equal 
to end, a single-base-pair feature would be denoted with start=end. 
Furthermore, it should be possible to indicate the total chromosome as a 
feature, and then, we can fulfill the second sentence of the quote above 
only by using closed intervals.

The UCSC Genome Browser team also interpret it that they. At
    http://genome.ucsc.edu/FAQ/FAQformat.html#format3
they write

"[column] 5. end - The ending position of the feature (inclusive)."

and I guess, "inclusive" means closed.

Only later, somebody came up with the idea of zero-length (a.k.a. 
inter-base) features such as positions of deletions, and suggested GFF3.

Unfortunately, zero-length features can only be represented by a 
half-open convention, as otherwise, we cannot distinguish whether a 
feature with start equal to end means a single base pair or an 
inter-base position. Hence, to me, this GFF3 proposal at 
http://www.sequenceontology.org/resources/gff3.html, which Hervé quoted, 
seems to be ill-defined. It implies that it is half-open, without saying 
so explicitly, and so breaks backwards compatibility.


<rant> Every half-way competent computer scientist knows that specifying 
_clearly_ whether the end is included is the very first thing one does 
when drafting a spec for anything involving intervals, because there is 
ample examples for specs using either choice. It baffles me that most 
genomics file format specs are so unclear in such things. Is our field 
really that unprofessional? </rant>

   Simon



More information about the Bioconductor mailing list