[BioC] Recommended gene model for DESeq

Thu Apr 5 11:56:32 CEST 2012

Hi Simon,

I support fully Ensembl GTF files. As for other GFF files, I describe in the vignette of the easyRNASeq package how to generate/modify them so that I can deal with them. I'm as well performing a number of validity checks on the resulting gene models and issue warnings whenever something that might result in double-counting occurs. This implies that the generated gene model might need to be further filtered after generation. That's something I need to describe better how to do in the vignette.

Gordon, if you try easyRNASeq, let me know what's your GFF file of choice and I'll help out as that would be helpful to generate the missing documentation :-)

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 5 Apr 2012, at 11:11, Simon Anders wrote:

> Hi Gordon
> 
> On 04/04/2012 10:30 PM, Assaf Gordon wrote:
>> I've read previous discussions about transcript vs. gene level [1]
>> and exon level considerations [2] but perhaps I've missed the bottom
>> line: Is it OK to have multiple isoforms per gene (and treat each
>> transcript as "gene record", which will result in some
>> double-counting of reads), or do I need to pre-process the gene model
>> file, to ensure there are no overlaps (e.g. by merging all isoforms
>> of a single gene) ? Or, is some post-processing needed to the DESeq
>> results (from nbinomTest()) to "normalize" genes with multiple
>> isoforms?
> 
> You should make sure that each read is counted only once per gene. Hence, counting per transcript and the summing up the counts for the transcripts from the same gene is not a good idea.
> 
> I use my htseq-count script (part of HTSeq, http://www-huber.embl.de/users/anders/HTSeq/). This Python script takes a GTF file and a SAM file and output a count table per gene. To this end, it finds all the exons with which the read overlaps and then checks whether they all have the same Gene ID. If so, it counts the read for this gene.
> 
> If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script, and described this in the vignette 'countGenomicOverlaps.pdf' (in the GenomicRanges package). Also look at Nico Delhomme's new package 'easyRNASeq' which is also meant to facilitate this task.
> 
> 
> A subtle but very annoying problem is this:
> 
> According to the GTF file standard specs, an "exon" or "CDS" line in a GTF file shlould have two attributes labelled "gene_id" and "transcript_id". Exon lines describing different transcripts of the same gene should hence have different transcript IDs but the same gene ID. Ensembl honors this rule while GTF files downloaded from the UCSC Table browser just have the transcript ID repeated as gene ID, which defeats the purpose, of course. A script won't notice that two transcripts belong to the same gene because they do not have the same gene ID.
> 
> Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC people to fix the issue. (iii) Fix a UCSC GTF file by removing from the gene_id attibute the number behind the last dot. Hence, taking e.g., this line from the UCSC GTF file for hg19:
> 
> chr1	hg19_knownGene	exon	16858	17055	0.000000	-	.	gene_id "uc009vjd.2"; transcript_id "uc009vjd.2";
> 
> change it to this
> 
> chr1	hg19_knownGene	exon	16858	17055	0.000000	-	.	gene_id "uc009vjd"; transcript_id "uc009vjd.2";
> 
> 
> This change should be easy to make on the fly by any of the three methods mentioned above. At least mine does not offer this functionality, but I should add it, I suppose. Valerie and Nico, how do your functions deal with this?
> 
> 
>  Simon
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor