[BioC] dexseq zero length exons
garciamanteiga.josemanuel at hsr.it
Mon Mar 10 13:01:59 CET 2014
I have just analysed my results using DEXSeq 1.9.
I had prepared the gff annotation using the version of
dexseq_prepare.annotation.py in DEXSeq 1.9 that contained the -r parameter
but counted with an old version, as you advised to me, of
the dexseq_count.py since the newest included a check for a NH tag, which
my bam files (SOAPSplice) did not contain. Everything worked fine but when
checking some of the exons I realised that there are some, and some amongst
the DEU results too, that showed length equal zero. I have checked and
there are a number of those amongst the GFF. I have checked in the original
gtf file I got from Ensembl (v72 hg19) for one of those genes and I do not
understand why there should be one of length 0.
Here you are a histogram of all exons showing very short exons and 0 length
ones and an example of the gtf and gff file prepared by
Is it normal? How can there be exons with the same start and end in the gff
file? How can they have reads assigned and even pass the tests?
# exons length
exon.length<-fData(ecs)$end - fData(ecs)$start
The gff file 'grepped' for one example:
$ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff
chr1 dexseq_prepare_annotation.py aggregate_gene 24683489 24743424 . -
*chr1 dexseq_prepare_annotation.py exonic_part 24683489 24683489 . - .
transcripts "ENST00000374409"; exonic_part_number "001"; gene_id
chr1 dexseq_prepare_annotation.py exonic_part 24683490 24683494 . - .
"ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id
chr1 dexseq_prepare_annotation.py exonic_part 24683495 24683526 . - .
exonic_part_number "003"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24683527 24685109 . - .
exonic_part_number "004"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24687341 24687531 . - .
exonic_part_number "005"; gene_id "ENSG00000001460"
chr1 dexseq_prepare_annotation.py exonic_part 24695194 24695532 . - .
"ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460"
Now the gtf used as input for dexseq_prepare_annotation.py for that gene
searching for the start and end coordinate:
grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489'
chr1 hg19_ensGene exon *24683489 24685109* 0.000000 - . gene_id
"ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1";
It seems that the gtf is OK and there is an exon that starts at that
position but nothing at the end.
I checked also the ecs object coming from an older analysis of the same
dataset using the two py scripts from the same release that gave me a good
results but I wanted to repeat to include the -r parameter, and it also
contains a lot of these zero length exons.
Is it normal?
I have further checked within the Genome Browser and it seems that there is
another Transcript with start at +1 with respect to that.
If this is so it means that zero-length in the end-start difference from
the gff file that I found in the fData of the ecs object is indeed 1bp
length and just a matter of 0/1 conventions. If this is so, I would be more
relieved, but please confirm.
In anycase, may I ask how reliable counts of ~75bp read can be to an exon
made from 1 pb difference in the start of two transcripts in the
annotation? how can they even pass the filter of p-value for DEU analysis?
Shouldn't it be a sort of cut off for this 'pseudo'exons?
Thanks again for your help
Jose M. Garcia Manteiga PhD
Research Assistant - Data Analysis in Functional Genomics
Center for Translational Genomics and BioInformatics
San Raffaele Scientific Institute
Via Olgettina 58, 20132 Milano (MI), Italy
e-mail: garciamanteiga.josemanuel at hsr.it
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5196 bytes
Desc: not available
More information about the Bioconductor