[BioC] retrieving annotation

Tue Nov 19 16:10:34 CET 2013

Hi Nico,

thanks for the hint. I will have a look at AnnotationHub. I was looking 
for the transcript_biotype rather than the gene_biotype (to discriminate 
protein_coding isoforms from the rest like processed_transcript etc), 
but this should also be included in the Ensembl gtf file.

Thanks,
Kathi

On 16/11/13 22:39, Nicolas Delhomme wrote:
> Hej Kathi!
>
> In a different thread (GTF file error when using easyRNAseq), Martin mentioned that you can access ensemble gff files through AnnotationHub.  I just copy part of this answer below and as you can see, the gene_biotype is part of the annotation:
>
>> library(AnnotationHub)
>> hub = AnnotationHub()
>> hub$ensembl.release.73.<tab>
> hub$ensembl.release.73.fasta. ... [378]
> hub$ensembl.release.73.gtf. ... [63]
>> xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData
>> xx
> GRanges with 381368 ranges and 12 metadata columns:
>                  seqnames       ranges strand   |         source        type
>                     <Rle>    <IRanges>  <Rle>   |       <factor>    <factor>
>        [1]              1 [1735, 2449]      +   | protein_coding        exon
>        [2]              1 [2379, 2449]      +   | protein_coding         CDS
>                score     phase            gene_id      transcript_id
>            <numeric> <integer>        <character>        <character>
>        [1]      <NA>      <NA> ENSGALG00000009771 ENSGALT00000015891
>        [2]      <NA>         0 ENSGALG00000009771 ENSGALT00000015891
>            exon_number   gene_biotype            exon_id         protein_id
>              <numeric>    <character>        <character>        <character>
>        [1]           1 protein_coding ENSGALE00000301221               <NA>
>        [2]           1 protein_coding               <NA> ENSGALP00000015874
>                 gene_name    transcript_name
>               <character>        <character>
>        [1]           <NA>               <NA>
>        [2]           <NA>               <NA>
> [ reached getOption("max.print") -- omitted 9 rows ]
>   ---
>   seqlengths:
>                     1                  2 ...     AADN03010940.1
>                    NA                 NA …                 NA
>
> Hope this helps,
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 7 Nov 2013, at 14:11, Kathi Zarnack <zarnack at ebi.ac.uk> wrote:
>
>> Hi,
>>
>> I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere).
>>
>> Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error:
>>
>>> test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3")
>> Error in tableNames(ucscTableQuery(session, track = track)) :
>>   error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes
>>
>> I tried to get the same table for hg18, but I get only one step further:
>>
>> test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3")
>> Download the wgEncodeGencodeManualV3 table ... OK
>> Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) :
>>   unknown table name 'wgEncodeGencodeClassesV3'
>>
>> Thank you very much for your help,
>> Kathi
>>
>>
>> ------------------------------------------
>>
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0
>> [4] GenomicRanges_1.14.3   XVector_0.2.0 IRanges_1.20.5
>> [7] BiocGenerics_0.8.0     BiocInstaller_1.12.0
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.18.0     Biostrings_2.30.0  bitops_1.0-6 BSgenome_1.30.0
>> [5] DBI_0.2-7          RCurl_1.95-4.1     Rsamtools_1.14.1 RSQLite_0.11.4
>> [9] rtracklayer_1.22.0 stats4_3.0.2       tcltk_3.0.2 tools_3.0.2
>> [13] XML_3.98-1.1       zlibbioc_1.8.0
>>
>>
>> -- 
>> Dr. Kathi Zarnack
>> Luscombe Group
>>
>> European Molecular Biology Laboratory
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> United Kingdom
>>
>> emailzarnack at ebi.ac.uk
>> tel +44 1223 494 526
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Dr. Kathi Zarnack
Luscombe Group

European Molecular Biology Laboratory
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

email zarnack at ebi.ac.uk
tel +44 1223 494 526