[BioC] GenomicFeatures: makeTranscriptDbFromUCSC on "refGene" supported?

Vincent Carey stvjc at channing.harvard.edu
Wed Jul 14 15:30:26 CEST 2010


it seems to me that the speed issues are probably related to UCSC
server availability at the time of your session, which might depend on
external factors.

however i can confirm a problem with the refGene request.   I got
further than a timeout

first i get a timing similar to yours for knownGene--

> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename
+ = "knownGene"))
   user  system elapsed
 65.585   0.675 200.614

then (and this event at line 26744 is literally reproducible)

> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename
+  = "refGene" )
+ )
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  line 26744 did not have 8 elements
Timing stopped at: 1.597 0.067 207.762

The GenomicFeatures developers will comment later.

> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-06-30 r52417)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15  IRanges_1.7.9

loaded via a namespace (and not attached):
 [1] BSgenome_1.17.5    Biobase_2.9.0      Biostrings_2.17.12 DBI_0.2-5
 [5] RCurl_1.4-2        RSQLite_0.9-1      XML_3.1-0          biomaRt_2.5.1
 [9] rtracklayer_1.9.3  tools_2.12.0


On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker
<erikvandenakker at gmail.com> wrote:
> Hi all,
>
> I'm a PhD student in bioinformatics working at the Leiden University Medical
>
> Center and at the Delft University of Technlogy in the Netherlands.
> Currently
> I'm working on the vizualization of genome wide data sources, such as
> Linkage,
> GWAS & Expression data.
> In order to be able to quickely access information on gene locations (along
> with the UTR, CDS, exons etc), I thought it would be a good idea to make use
>
> of the GenomicFeatures package. This package works perfectly and very
> quickely
> for the example provided in the vignette (good job!):
>
>> library(GenomicFeatures)
>> system.time(mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
> "knownGene"))
>   user  system elapsed
>  49.50    0.69  100.05
>
>> mm9KG
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: mm9
> | UCSC Table: knownGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 49409
> | exon_nrow: 237551
> | cds_nrow: 204831
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010)
> | GenomicFeatures version at creation time: 1.0.3
> | RSQLite version at creation time: 0.9-1
>
>
> And even for larger databases(humans), this works perfectly:
>
>> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename
> = "knownGene"))
>   user  system elapsed
>  82.09    1.11  162.53
>
>> hg19KG
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: knownGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 77614
> | exon_nrow: 281605
> | cds_nrow: 236664
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010)
> | GenomicFeatures version at creation time: 1.0.3
> | RSQLite version at creation time: 0.9-1
>
> However, for tablename = "refGene" I had to shoot down my R session after
> half an hour for both the settings genome = "mm9" & genome = "hg19"
>
>> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
> "refGene"))
>
>> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename
> = "refGene"))
>
> As this package makes use of functionalities provided by rtracklayer, before
>
> the actual SQLite db is stored, I verified whether this was working
> correctly:
>
>> library(rtracklayer)
>> session  <- browserSession()
>> genome(session) <- "hg19"
>> query <- ucscTableQuery(session,"refGene")
>> system.time(Table <- getTable(query))
>  user  system elapsed
>   7.70    0.39   61.73
>
> Typing "head(Table)" gave the expected results, suggesting that something
> is not working correctly in creating the SQLite databases.
>
> So, my question:
> Given that refGene pops up when using supportedUCSCtables(),
> I wondered:
> 1) Did I do something wrong?; 2) should I just have more patience & 3) could
> anyone
> confirm these problems?
> And
> @PackageMaintainers: If this is a genuine bug, are you planning to fix this
> or speed things up?
>
> As I work with gene expression data, which are commonly annotated to either
> RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with
> TranscriptDBs
> based on these features. Although I can think of many work around solutions
> using "knownGene" I would prefer to work with the package as originally
> intended
> and hence this post.
>
> Thanks for the work already done on this great package!
>
> Cheerz,
>
> Erik van den Akker
>
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
> LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] rtracklayer_1.8.1     RCurl_1.4-2           bitops_1.0-4.1
> GenomicFeatures_1.0.3 GenomicRanges_1.0.5   IRanges_1.6.8
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0     biomaRt_2.4.0     Biostrings_2.16.7 BSgenome_1.16.5
> DBI_0.2-5         RSQLite_0.9-1     tools_2.11.1      XML_3.1-0
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list