[BioC] htseq and dexseq
anders at embl.de
Thu Jun 27 11:01:51 CEST 2013
I'm not sure if I completely understand what the question is, but maybe
the following remarks help:
If a read overlaps with several exons of the _same_ gene,
dexseq_count.py counts the read for each of these exons. If the read
overlaps with exons from different genes, it gets discarded.
Unfortunately, annotation files such as Ensembl's GTF file for humans,
have a few overlapping genes: The right-most exons of one gene overlap
or are even identical with the lef-most exons of the neighboring gene.
According to the rules just stated, reads mapping to these exons should
get discarded, because we would not be able anyway to say which of the
two gene to count this for.
One might, however, also argue that if an exon is assigned to two genes
then these are not two genes, but different sets of isoforms of one and
the same gene. Therefore, dexseq_prepare_annotation.py will "fuse" these
genes into an "aggregate gene" and give it a name formed by
concatenating the fused genes' names with plus signs.
This sounded good in theory but, in practice, turned out to be not that
useful, because such aggregate genes are hard to interpret. Furthermore,
the Ensembl curators probably had good reasons to not fuse the two
genes, and the better strategy might hence be to keep the genes seperate
and accept that we cannot do inference on the one or two exons that
overlap between he genes.
Hence, the new version of dexseq_prepare_annotation.py now accepts a new
option, '-r', which switches off gene aggregation ('-r no').
More information about the Bioconductor