[BioC] htseq and dexseq

Simon Anders anders at embl.de
Thu Jun 27 11:01:51 CEST 2013


I'm not sure if I completely understand what the question is, but maybe 
the following remarks help:

If a read overlaps with several exons of the _same_ gene, 
dexseq_count.py counts the read for each of these exons. If the read 
overlaps with exons from different genes, it gets discarded.

Unfortunately, annotation files such as Ensembl's GTF file for humans, 
have a few overlapping genes: The right-most exons of one gene overlap 
or are even identical with the lef-most exons of the neighboring gene. 
According to the rules just stated, reads mapping to these exons should 
get discarded, because we would not be able anyway to say which of the 
two gene to count this for.

One might, however, also argue that if an exon is assigned to two genes 
then these are not two genes, but different sets of isoforms of one and 
the same gene. Therefore, dexseq_prepare_annotation.py will "fuse" these 
genes into an "aggregate gene" and give it a name formed by 
concatenating the fused genes' names with plus signs.

This sounded good in theory but, in practice, turned out to be not that 
useful, because such aggregate genes are hard to interpret. Furthermore, 
the Ensembl curators probably had good reasons to not fuse the two 
genes, and the better strategy might hence be to keep the genes seperate 
and accept that we cannot do inference on the one or two exons that 
overlap between he genes.

Hence, the new version of dexseq_prepare_annotation.py now accepts a new 
option, '-r', which switches off gene aggregation ('-r no').


More information about the Bioconductor mailing list