[BioC] gff files: how to tell if right-open interval convention used?

Hervé Pagès hpages at fhcrc.org
Fri Apr 1 01:26:28 CEST 2011


karl,

On 03/31/2011 03:42 PM, karlerhard at berkeley.edu wrote:
>
> Thanks for the answers, very helpful.  I was a bit confused reading the
> help page for the readGff3 function, which states:
>
> "When the GFF file follows the right-open interval convention (isRightOpen
> is TRUE), then GFF entries for which end base equals first base are
> recognized as zero-length features and loaded as inter_base intervals."

Also in the same man page:

   Usage:

      readGff3(file, isRightOpen=TRUE)

   Arguments:

     file: The name of the gff file to read.

   isRightOpen: Although a proper GFF3 file follows the convention of
           right-open intervals, improper GFF files following the
           right-closed convention are frequently found.  Set
           ‘isRightOpen = FALSE’ in this case.

But this looks incorrect to me.

My understanding of the GFF3 specs at http://www.sequenceontology.org
is different:

   Columns 4 & 5: "start" and "end" The start and end of the feature, in
   1-based integer coordinates, relative to the landmark given in column
   1. Start is always less than or equal to end.

   <-- omitting some stuff -->

   For zero-length features, such as insertion sites, start equals end
   and the implied site is to the right of the indicated base in the
   direction of the landmark.

Nothing is said about using a right-open interval convention.

Zero-length features are treated specially but still this is not using
a right-open interval convention for them (the way they are represented
could maybe be interpreted as a right-open interval if the implied site
was to the *left* of the indicated base but it's not the case).

>
> This suggests that there are gff files out there that have this right-open
> interval convention.  I will just have to find out the source of the
> particular gff file I'm using (which describes a filtered gene list for
> maize) to determine whether it uses this convention or not.  Though based
> on your comments, I think I should assume that it is 1-based closed.

Yes, 1-based closed on both sides.

Cheers,
H.

>
>
>
>> karl,
>>
>> GFF should always be 1-based closed, not 0-based right-open (unlike BED
>> format).  I think this convention goes back to the original version of GFF
>> from Sanger up to the latest version, GFF3.
>>
>> So, it probably comes down to whether the source of the GFF output you are
>> using is generating the correct coordinates, not how R/BioC is processing
>> it.  Unless the girafe method in question is allowing BED output to be
>> read as well (I would consider that bad).
>>
>> chris
>>
>> On Mar 30, 2011, at 3:58 PM, karlerhard at berkeley.edu wrote:
>>
>>>
>>> Hi all,
>>>
>>> I'm a grad student at UC Berkeley, I'm new to the list, as well as to R
>>> programs in general, so I hope you'll forgive my simplistic questions.
>>>
>>> I'm working with the girafe package to generate counts table which can
>>> be
>>> input into edgeR.  I've noticed that the readGff3 function is sensitive
>>> to
>>> whether the gff file being read uses this "right-open interval
>>> convention"
>>> or not.  I'm just not sure how to tell if the gff file I am using
>>> follows
>>> this convention.  Is there a simple way to find out?
>>>
>>> Any help on this would be greatly appreciated.
>>>
>>> best,
>>>
>>> karl
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list