[BioC] GenomicFeatures Transcripts Retrieval Fails

James W. MacDonald jmacdon at uw.edu
Tue Jun 17 19:50:59 CEST 2014


Hi Sharvari,

On 6/17/2014 1:04 PM, sharvari gujja wrote:
> Hi Jim,
>
> Thanks for the reply. Yes, I am running this on Windows. I followed your
> suggestion to use setInternet2() function first, but I still get an error:
>
>  > setInternet2()
>  > txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')
> Error in function (type, msg, asError = TRUE)  : couldn't connect to host

You might consult one of your IT people about that, but otherwise I have 
nothing more for you.

Well, except you could use the makeTranscriptDbFromGFF() function if you 
are really a fan of ensGene. You can go to UCSC using a browser, then 
click on the 'Tables' menu item. There you could choose

group: Genes and Gene Predictions
track: Ensembl Genes
table: ensGene
output format: GTF - gene transfer format

Set the output file to be something reasonable, then click 'get output'. 
You can then use that to create a TxDb.

But unless you are dead set on using Ensembl genes, it's probably not 
worth the bother.

>
> I also tried:
>
>  > biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
> BioC_mirror: http://bioconductor.org
> Using Bioconductor version 2.14 (BiocInstaller 1.14.2), R version 3.1.0.
> Installing package(s) 'TxDb.Hsapiens.UCSC.hg19.knownGene'
> trying URL
> 'http://bioconductor.org/packages/2.14/data/annotation/bin/windows/contrib/3.1/TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0.zip'
> Content type 'application/zip' length 18546564 bytes (17.7 Mb)
> opened URL
> downloaded 17.7 Mb
>
> How do I read this table "TxDb.Hsapiens.UCSC.hg19.knownGene"? Also, is
> there documentation on the differences between "knownGene" and "ensGene"?

You want to do some reading:

http://bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf

There are literally a bazillion things you can do with a TxDb object, so 
unless you have a use case that you want to talk about, you will have to 
do some self-learning (which you should be doing anyway, so there you go).

As far as documentation, you can start with UCSC's table page. If you do 
as I describe above, and then click on the 'describe table schema' 
button you get a page that says that the genes and gene predictions come 
from Ensembl, and they have a link to ensembl.org, where you can do more 
reading.

For the knownGene table, if you change to

track: UCSC Genes
table: knownGene

and then click 'describe table schema' there is a whole webpage 
describing how they generate those data.

Best,

Jim


>
> Thanks for helping.
> Sharvari
>
>
>
>
> On Tue, Jun 17, 2014 at 12:39 PM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Sharvari,
>
>
>     On 6/17/2014 10:56 AM, Sharvari.Gujja at sanofi.com
>     <mailto:Sharvari.Gujja at sanofi.com> wrote:
>
>         Hi Steve,
>
>
>         I get the same error trying to run txdb <-
>         makeTranscriptDbFromUCSC(__genome='hg19',tablename='__knownGene')
>
>         Error in function (type, msg, asError = TRUE)  : couldn't
>         connect to host
>
>
>     This error means you are not able to connect to UCSC. This may be
>     due to an intermittent outage on their end, or possibly because you
>     are behind a firewall.
>
>     But note that if you want the knownGene transcript package, you can
>     get that from Bioconductor without having to build it yourself:
>
>     library(BiocInstaller)
>     biocLite("TxDb.Hsapiens.UCSC.__hg19.knownGene")
>
>     If you want the ensGene table you will have to build that one
>     yourself. I just tried that using your code, and it works for me:
>
>
>      > txdb <-
>     makeTranscriptDbFromUCSC(__genome='hg19',tablename='__ensGene')
>     Download the ensGene table ... OK
>     Extract the 'transcripts' data frame ... OK
>     Extract the 'splicings' data frame ... OK
>     Download and preprocess the 'chrominfo' data frame ... OK
>     Prepare the 'metadata' data frame ... OK
>     Make the TranscriptDb object ... OK
>     Warning message:
>     In .__extractCdsLocsFromUCSCTxTable(__ucsc_txtable, exon_locs) :
>        UCSC data anomaly in 19284 transcript(s): the cds cumulative
>     length is
>        not a multiple of 3 for transcripts ‘ENST00000513161’
>        ‘ENST00000417833’ ‘ENST00000450884’ ‘ENST00000431193’
>        ‘ENST00000367667’ ‘ENST00000498306’ ‘ENST00000434641’
>        ‘ENST00000462097’ ‘ENST00000475119’ ‘ENST00000480643’
>        ‘ENST00000525843’ ‘ENST00000498419’ ‘ENST00000532678’
>        ‘ENST00000460428’ ‘ENST00000478853’ ‘ENST00000372925’
>        ‘ENST00000437607’ ‘ENST00000416121’ ‘ENST00000582567’
>        ‘ENST00000413489’ ‘ENST00000425265’ ‘ENST00000534717’
>        ‘ENST00000436685’ ‘ENST00000606954’ ‘ENST00000484054’
>        ‘ENST00000414971’ ‘ENST00000443667’ ‘ENST00000417191’
>        ‘ENST00000559578’ ‘ENST00000482110’ ‘ENST00000524607’
>        ‘ENST00000419169’ ‘ENST00000295713’ ‘ENST00000609181’
>        ‘ENST00000327794’ ‘ENST00000450490’ ‘ENST00000602582’
>        ‘ENST00000453676’ ‘ENST00000513088’ ‘ENST [... truncated]
>      > txdb
>     TranscriptDb object:
>     | Db type: TranscriptDb
>     | Supporting package: GenomicFeatures
>     | Data source: UCSC
>     | Genome: hg19
>     | Organism: Homo sapiens
>     | UCSC Table: ensGene
>     | Resource URL: http://genome.ucsc.edu/
>     | Type of Gene ID: Ensembl gene ID
>     | Full dataset: yes
>     | miRBase build ID: NA
>     | transcript_nrow: 204940
>     | exon_nrow: 584914
>     | cds_nrow: 280379
>     | Db created by: GenomicFeatures package from Bioconductor
>     | Creation time: 2014-06-17 09:34:13 -0700 (Tue, 17 Jun 2014)
>     | GenomicFeatures version at creation time: 1.16.2
>     | RSQLite version at creation time: 0.11.4
>     | DBSCHEMAVERSION: 1.0
>
>     So you might try again. If you are on Windows, you might be having a
>     proxy issue, in which case you might use the setInternet2() function
>     prior to running makeTranscriptDbFromUCSC().
>
>     Best,
>
>     Jim
>
>
>
>
>
>
>
>         txdb <-
>         makeTranscriptDbFromUCSC(__genome='hg19',tablename='__ensGene')
>
>         Error in function (type, msg, asError = TRUE)  : couldn't
>         connect to host
>
>         I did install the required packages, so not what I am missing here.
>
>         source("http://bioconductor.__org/biocLite.R
>         <http://bioconductor.org/biocLite.R>")
>         biocLite()
>         biocLite(c("GenomicFeatures", "AnnotationDbi"))
>         library("GenomicFeatures")
>
>         Could you please help me with this error.
>
>         Many Thanks
>         Sharvari Gujja
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list