[BioC] DEXSeq: problem with dexseq_prepare_annotation.py

Wolfgang Huber whuber at embl.de
Wed Apr 18 18:31:31 CEST 2012


Dear Stephen

as you might have guessed, the reason that the script raises these 
errors is to make you (i.e. users) look at such dodgy GTF records and to 
verify that these are indeed easily fixable formatting mistakes - 
instead of just proceeding with some sort of automatic resolution and 
potentially running into bigger, more obscure errors downstream.

I suppose if we have enough confidence on what the typical sources of 
dodginess are, and how their resolution can be automated, that could be 
added to the script. Contributions are welcome.

	Best wishes
	Wolfgang

Apr/18/12 9:30 AM, Alejandro Reyes scripsit::
> Dear Stephen,
>
> The problem is that the exon is finishing (148078883) before it starts
> (148079388).
> Usually the gtf files contain the "left most" position as start position
> independently of the strand, also the UCSC annotations but with a few
> exceptions as the one you mentioned. The reason of the script complain
> is that these gtf parsers in HTSeq were written and tested for ENSEMBL
> annotation files, where these kind of "errors" are basically absent.
> Assuming that these are simple mistakes in the gtf files, it is easy to
> solve them just by flipping the positions of the start and ends when
> start > end, or maybe by deleting this genes from the gtf files. Anyway,
> these cases are really rare, like ~15 in the human UCSC genome.
>
> Alejandro
>
>
>
>> Alejandro, Simon, Wolfgang, et al.:
>>
>> I'm trying to use the dexseq_prepare_annotation.py script to parse the
>> UCSC hg18 genes.gtf GTF file included with the Illumina igenomes
>> packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an
>> error:
>>
>> Traceback (most recent call last):
>> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in<module>
>> raise ValueError, "Same name found on two chromosomes: %s, %s" % (
>> str(l[i]), str(l[i+1]) )
>> ValueError: Same name found on two chromosomes:<GenomicFeature:
>> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>,
>> <GenomicFeature: exonic_part 'CFB' at chr6_cox_hap1: 3359983 ->
>> 3360325 (strand '+')>
>>
>> I'm guessing this is because the same gene name is found in two
>> separate places. I'm not entirely sure what these two chromosomal
>> segments refer to, but I removed them from the GTF file and the python
>> script threw another error:
>>
>> Traceback (most recent call last):
>> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in<module>
>> assert l[i].iv.end<= l[i+1].iv.start, str(l[i+1]) + " starts too early"
>> AssertionError:<GenomicFeature: exonic_part 'HIST2H3C+HIST2H3A' at
>> chr1: 148079388 -> 148078883 (strand '-')> starts too early
>>
>> I'm really unsure what to make of this or how to fix it. The script
>> works without issues with the Ensembl GTF. Any help would be greatly
>> appreciated.
>>
>> Stephen
>>
>> -----------------------------------------
>> Stephen D. Turner, Ph.D.
>> Bioinformatics Core Director
>> University of Virginia School of Medicine
>> bioinformatics.virginia.edu
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list