[BioC] TranscriptDb of GENCODE Genes

Fri Aug 9 09:00:24 CEST 2013

Hello,

Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012.

cds_end_NF : the coding region end could not be confirmed.
cds_start_NF : the coding region start could not be confirmed. 
mRNA_end_NF : the mRNA end could not be confirmed. 
mRNA_start_NF : the mRNA start could not be confirmed.

Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR.

/nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL)     transcript" gencode.v17.annotation.gtf
194871
/nb/dario/genes$ egrep "(HAVANA|ENSEMBL)        transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF -
21699
/nb/dario/genes$ egrep "(HAVANA|ENSEMBL)        transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF -
19788

Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too.

Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is.

This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa :

genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number")
UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE)
UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE)
whichNo3prime <- setdiff(names(UTR5), names(UTR3))
whichNo5prime <- setdiff(names(UTR3), names(UTR5))

> length(whichNo5prime)
[1] 12217
> length(whichNo3prime)
[1] 16675

So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR.

Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ?

--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia