[BioC] GenomicFeatures: makeTranscriptDbFromUCSC on "refGene" supported?

Hervé Pagès hpages at fhcrc.org
Wed Jul 14 23:06:12 CEST 2010


Hi Erik, Vince,

I'm puzzled by this. Populating the db is made using prepared
statements which are usually very fast. I'm investigating and
will let you know. Thanks for the report.

H.


On 07/14/2010 06:30 AM, Vincent Carey wrote:
> it seems to me that the speed issues are probably related to UCSC
> server availability at the time of your session, which might depend on
> external factors.
>
> however i can confirm a problem with the refGene request.   I got
> further than a timeout
>
> first i get a timing similar to yours for knownGene--
>
>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename
> + = "knownGene"))
>     user  system elapsed
>   65.585   0.675 200.614
>
> then (and this event at line 26744 is literally reproducible)
>
>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename
> +  = "refGene" )
> + )
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>    line 26744 did not have 8 elements
> Timing stopped at: 1.597 0.067 207.762
>
> The GenomicFeatures developers will comment later.
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-06-30 r52417)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15  IRanges_1.7.9
>
> loaded via a namespace (and not attached):
>   [1] BSgenome_1.17.5    Biobase_2.9.0      Biostrings_2.17.12 DBI_0.2-5
>   [5] RCurl_1.4-2        RSQLite_0.9-1      XML_3.1-0          biomaRt_2.5.1
>   [9] rtracklayer_1.9.3  tools_2.12.0
>
>
> On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker
> <erikvandenakker at gmail.com>  wrote:
>> Hi all,
>>
>> I'm a PhD student in bioinformatics working at the Leiden University Medical
>>
>> Center and at the Delft University of Technlogy in the Netherlands.
>> Currently
>> I'm working on the vizualization of genome wide data sources, such as
>> Linkage,
>> GWAS&  Expression data.
>> In order to be able to quickely access information on gene locations (along
>> with the UTR, CDS, exons etc), I thought it would be a good idea to make use
>>
>> of the GenomicFeatures package. This package works perfectly and very
>> quickely
>> for the example provided in the vignette (good job!):
>>
>>> library(GenomicFeatures)
>>> system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
>> "knownGene"))
>>    user  system elapsed
>>   49.50    0.69  100.05
>>
>>> mm9KG
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: mm9
>> | UCSC Table: knownGene
>> | Type of Gene ID: Entrez Gene ID
>> | Full dataset: yes
>> | transcript_nrow: 49409
>> | exon_nrow: 237551
>> | cds_nrow: 204831
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010)
>> | GenomicFeatures version at creation time: 1.0.3
>> | RSQLite version at creation time: 0.9-1
>>
>>
>> And even for larger databases(humans), this works perfectly:
>>
>>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename
>> = "knownGene"))
>>    user  system elapsed
>>   82.09    1.11  162.53
>>
>>> hg19KG
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg19
>> | UCSC Table: knownGene
>> | Type of Gene ID: Entrez Gene ID
>> | Full dataset: yes
>> | transcript_nrow: 77614
>> | exon_nrow: 281605
>> | cds_nrow: 236664
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010)
>> | GenomicFeatures version at creation time: 1.0.3
>> | RSQLite version at creation time: 0.9-1
>>
>> However, for tablename = "refGene" I had to shoot down my R session after
>> half an hour for both the settings genome = "mm9"&  genome = "hg19"
>>
>>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
>> "refGene"))
>>
>>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename
>> = "refGene"))
>>
>> As this package makes use of functionalities provided by rtracklayer, before
>>
>> the actual SQLite db is stored, I verified whether this was working
>> correctly:
>>
>>> library(rtracklayer)
>>> session<- browserSession()
>>> genome(session)<- "hg19"
>>> query<- ucscTableQuery(session,"refGene")
>>> system.time(Table<- getTable(query))
>>   user  system elapsed
>>    7.70    0.39   61.73
>>
>> Typing "head(Table)" gave the expected results, suggesting that something
>> is not working correctly in creating the SQLite databases.
>>
>> So, my question:
>> Given that refGene pops up when using supportedUCSCtables(),
>> I wondered:
>> 1) Did I do something wrong?; 2) should I just have more patience&  3) could
>> anyone
>> confirm these problems?
>> And
>> @PackageMaintainers: If this is a genuine bug, are you planning to fix this
>> or speed things up?
>>
>> As I work with gene expression data, which are commonly annotated to either
>> RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with
>> TranscriptDBs
>> based on these features. Although I can think of many work around solutions
>> using "knownGene" I would prefer to work with the package as originally
>> intended
>> and hence this post.
>>
>> Thanks for the work already done on this great package!
>>
>> Cheerz,
>>
>> Erik van den Akker
>>
>>
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-pc-mingw32
>>
>> locale:
>> [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
>> LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
>> [5] LC_TIME=Dutch_Netherlands.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] rtracklayer_1.8.1     RCurl_1.4-2           bitops_1.0-4.1
>> GenomicFeatures_1.0.3 GenomicRanges_1.0.5   IRanges_1.6.8
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0     biomaRt_2.4.0     Biostrings_2.16.7 BSgenome_1.16.5
>> DBI_0.2-5         RSQLite_0.9-1     tools_2.11.1      XML_3.1-0
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list