[BioC] retrieving annotation

Nicolas Delhomme delhomme at embl.de
Sat Nov 16 23:39:13 CET 2013


Hej Kathi!

In a different thread (GTF file error when using easyRNAseq), Martin mentioned that you can access ensemble gff files through AnnotationHub.  I just copy part of this answer below and as you can see, the gene_biotype is part of the annotation:

> library(AnnotationHub)
> hub = AnnotationHub()
> hub$ensembl.release.73.<tab>
hub$ensembl.release.73.fasta. ... [378]
hub$ensembl.release.73.gtf. ... [63]
> xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData
> xx
GRanges with 381368 ranges and 12 metadata columns:
                seqnames       ranges strand   |         source        type
                   <Rle>    <IRanges>  <Rle>   |       <factor>    <factor>
      [1]              1 [1735, 2449]      +   | protein_coding        exon
      [2]              1 [2379, 2449]      +   | protein_coding         CDS
              score     phase            gene_id      transcript_id
          <numeric> <integer>        <character>        <character>
      [1]      <NA>      <NA> ENSGALG00000009771 ENSGALT00000015891
      [2]      <NA>         0 ENSGALG00000009771 ENSGALT00000015891
          exon_number   gene_biotype            exon_id         protein_id
            <numeric>    <character>        <character>        <character>
      [1]           1 protein_coding ENSGALE00000301221               <NA>
      [2]           1 protein_coding               <NA> ENSGALP00000015874
               gene_name    transcript_name
             <character>        <character>
      [1]           <NA>               <NA>
      [2]           <NA>               <NA>
[ reached getOption("max.print") -- omitted 9 rows ]
 ---
 seqlengths:
                   1                  2 ...     AADN03010940.1
                  NA                 NA …                 NA

Hope this helps,

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 7 Nov 2013, at 14:11, Kathi Zarnack <zarnack at ebi.ac.uk> wrote:

> Hi,
> 
> I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere).
> 
> Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error:
> 
> > test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3")
> Error in tableNames(ucscTableQuery(session, track = track)) :
>  error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes
> 
> I tried to get the same table for hg18, but I get only one step further:
> 
> test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3")
> Download the wgEncodeGencodeManualV3 table ... OK
> Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) :
>  unknown table name 'wgEncodeGencodeClassesV3'
> 
> Thank you very much for your help,
> Kathi
> 
> 
> ------------------------------------------
> 
> > sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets methods
> [8] base
> 
> other attached packages:
> [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0
> [4] GenomicRanges_1.14.3   XVector_0.2.0 IRanges_1.20.5
> [7] BiocGenerics_0.8.0     BiocInstaller_1.12.0
> 
> loaded via a namespace (and not attached):
> [1] biomaRt_2.18.0     Biostrings_2.30.0  bitops_1.0-6 BSgenome_1.30.0
> [5] DBI_0.2-7          RCurl_1.95-4.1     Rsamtools_1.14.1 RSQLite_0.11.4
> [9] rtracklayer_1.22.0 stats4_3.0.2       tcltk_3.0.2 tools_3.0.2
> [13] XML_3.98-1.1       zlibbioc_1.8.0
> 
> 
> -- 
> Dr. Kathi Zarnack
> Luscombe Group
> 
> European Molecular Biology Laboratory
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
> 
> emailzarnack at ebi.ac.uk
> tel +44 1223 494 526
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list