[BioC] dexseq zero length exons

Mon Mar 10 13:01:59 CET 2014

Dear Alejandro,
I have just analysed my results using DEXSeq 1.9.
I had prepared the gff annotation using the version of
dexseq_prepare.annotation.py in DEXSeq 1.9 that contained the -r parameter
but counted with an old version, as you advised to me, of
the dexseq_count.py since the newest included a check for a NH tag, which
my bam files (SOAPSplice) did not contain. Everything worked fine but when
checking some of the exons I realised that there are some, and some amongst
the DEU results too, that showed length equal zero. I have checked and
there are a number of those amongst the GFF. I have checked in the original
gtf file I got from Ensembl (v72 hg19) for one of those genes and I do not
understand why there should be one of length 0.
Here you are a histogram of all exons showing very short exons and 0 length
ones and an example of the gtf and gff file prepared by
dexseq_prepare_annotation.py.
Is it normal? How can there be exons with the same start and end in the gff
file? How can they have reads assigned and even pass the tests?

# exons length
exon.length<-fData(ecs)$end - fData(ecs)$start
hist(log2(exon.length),breaks=100)

The gff file 'grepped' for one example:

$ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff

chr1 dexseq_prepare_annotation.py aggregate_gene 24683489 24743424 . -
. gene_id
"ENSG00000001460"

*chr1 dexseq_prepare_annotation.py exonic_part 24683489 24683489 . - .
transcripts "ENST00000374409"; exonic_part_number "001"; gene_id
"ENSG00000001460"*

chr1 dexseq_prepare_annotation.py exonic_part 24683490 24683494 . - .
transcripts
"ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id
"ENSG00000001460"

chr1 dexseq_prepare_annotation.py exonic_part 24683495 24683526 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409";
exonic_part_number "003"; gene_id "ENSG00000001460"

chr1 dexseq_prepare_annotation.py exonic_part 24683527 24685109 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST00000337248";
exonic_part_number "004"; gene_id "ENSG00000001460"

chr1 dexseq_prepare_annotation.py exonic_part 24687341 24687531 . - .
transcripts
"ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST00000337248";
exonic_part_number "005"; gene_id "ENSG00000001460"

chr1 dexseq_prepare_annotation.py exonic_part 24695194 24695532 . - .
transcripts
"ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460"
Now the gtf used as input for dexseq_prepare_annotation.py for that gene
searching for the start and end coordinate:

grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489'

chr1 hg19_ensGene exon *24683489 24685109* 0.000000 - . gene_id
"ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1";

It seems that the gtf is OK and there is an exon that starts at that
position but nothing at the end.

I checked also the ecs object coming from an older analysis of the same
dataset using the two py scripts from the same release that gave me a good
results but I wanted to repeat to include the -r parameter, and it also
contains a lot of these zero length exons.

Is it normal?

I have further checked within the Genome Browser and it seems that there is
another Transcript with start at +1 with respect to that.

If this is so it means that zero-length in the end-start difference from
the gff file that I found in the fData of the ecs object is indeed 1bp
length and just a matter of 0/1 conventions. If this is so, I would be more
relieved, but please confirm.

In anycase, may I ask how reliable counts of ~75bp read can be to an exon
made from 1 pb difference in the start of two transcripts in the
annotation? how can they even pass the filter of p-value for DEU analysis?
Shouldn't it be a sort of cut off for this 'pseudo'exons?

Thanks again for your help

Best

Jose

-- 
Jose M. Garcia Manteiga PhD
Research Assistant - Data Analysis in Functional Genomics
Center for Translational Genomics and BioInformatics
Dibit2-Basilica, 3A3
San Raffaele Scientific Institute
Via Olgettina 58, 20132 Milano (MI), Italy

Tel: +39-02-2643-4751
e-mail: garciamanteiga.josemanuel at hsr.it
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Exon_length_1.9.pdf
Type: application/pdf
Size: 5196 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140310/b3021f2f/attachment.pdf>