[BioC] Recommended gene model for DESeq

Wed Apr 4 22:30:34 CEST 2012

Hello,

I have a question regarding the gene model source to use with DESeq.

Assuming the following workflow:
1. Map reads to genome (bowtie/tophat/bwa/etc).
2. Count hits-per-gene (HTSeq / CoverageBed / etc. )
3. Repeat 1,2 for all samples, merge together into one table.
4. Run DESeq on merged table.

My question is about step 2:
What is the recommended gene model to use when counting hits-per-gene ?

RefSeq-Genes, UCSC Known Genes, Ensembl Genes and others come to mind,
but those usually contain multiple transcripts per gene as different records - would that skew the DESeq results?

(Note that I'm interested in gene-level differential expression, not worried about isoform-level differential expression).

I've read previous discussions about transcript vs. gene level [1] and exon level considerations [2] but perhaps I've missed the bottom line:
Is it OK to have multiple isoforms per gene (and treat each transcript as "gene record", which will result in some double-counting of reads), or do I need to pre-process the gene model file, to ensure there are no overlaps (e.g. by merging all isoforms of a single gene) ?
Or, is some post-processing needed to the DESeq results (from nbinomTest()) to "normalize" genes with multiple isoforms?

Any suggestions and comments will be appreciated (or corrections, if something above is wrong).

Thanks,
 -gordon

[1] http://article.gmane.org/gmane.science.biology.informatics.conductor/38805/
[2] http://article.gmane.org/gmane.science.biology.informatics.conductor/38915/