[BioC] GenomicFeatures: makeTranscriptDbFromUCSC on "refGene" supported?

Hervé Pagès hpages at fhcrc.org
Thu Jul 15 09:01:22 CEST 2010


This is fixed in GenomicFeatures release (1.0.5) and devel
(1.1.6). Both should become available via biocLite() in the
next 24 or 36 hours.

Thanks again for the report.

Cheers,
H.

On 07/14/2010 02:18 PM, Erik van den Akker wrote:
>
> Hi Vince & Hervé,
>
> I'm afraid i've run into another (small) problem with using the
> GenomicFeatures package:
>
> For the sake of clarity, I've  used the example as described in the
> vignette again for making a
> TranscriptDb object:
>
>  > library(GenomicFeatures)
>  > mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
> "knownGene")
>  > GR <- transcripts(mm9KG)
>  > GR[1]
> GRanges with 1 range and 2 elementMetadata values
>      seqnames             ranges strand |     tx_id     tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1]     chr1 [4797974, 4832908]      + |        14 *uc007afg.1*
>
> seqlengths
>           chr1         chr2         chr3         chr4
> chr5         chr6 ...  chr7_random  chr8_random  chr9_random
> chrUn_random  chrX_random  chrY_random
>      197195432    181748087    159599783    155630120    152537259
> 149517037 ...       362490       849593       449403      5900358
> 1785075     58682461
>
> Next I wanted to selectively extract features from the database based on
> the criteria which can be
> specified with the "vals" argument:
>
>  > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*"))
>
> which returns the same information as GR[1] as expected. By setting the
> "columns" argument to
> c("tx_id","tx_name","exon_id","gene_id"), I found that "18777" was the
> associated gene_id:
>
>  >
> transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*"),columns=c("tx_id","tx_name","gene_id"))
> GRanges with 1 range and 3 elementMetadata values
>      seqnames             ranges strand |     tx_id
> tx_name                   gene_id
> <Rle> <IRanges> <Rle> | <integer> <character> <CompressedCharacterList>
> [1]     chr1 [4797974, 4832908]      + |        14 *uc007afg.1* *18777*
>
> seqlengths
>           chr1         chr2         chr3         chr4
> chr5         chr6 ...  chr7_random  chr8_random  chr9_random
> chrUn_random  chrX_random  chrY_random
>      197195432    181748087    159599783    155630120    152537259
> 149517037 ...       362490       849593       449403      5900358
> 1785075     58682461
>
> Besides tx_name, subselections can be specified by tx_id, tx_chrom,
> tx_name, tx_strand & gene_id.
> All these worked fine except for the latter, which gave the following error:
>
>  >
> transcripts(mm9KG,vals=list(gene_id="*18777*"),columns=c("tx_id","tx_name","gene_id"))
> Error in sqliteExecStatement(con, statement, bind.data) :
>    RS-DBI driver: (error in statement: no such column: gene_id)
>
> As gene_id is among ValidValNames (as specified in the function
> transcripts) and all other
> criteria gave the expected results, I'm affraid this is a bug. I've
> noticed this was the only
> one selecting in a <CompressedCharacterList>, so this might perhaps be
> the reason.
>
> Except for these minor issues, I do think this package is very relevant
> and I see many applications for my
> current work, therefore, thanks again,
>
> Cheerz,
>
> Erik van den Akker
>
>  > sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252
> LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
> LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.0.3 GenomicRanges_1.0.5   IRanges_1.6.8
>
> loaded via a namespace (and not attached):
>   [1] Biobase_2.8.0     biomaRt_2.4.0     Biostrings_2.16.7
> BSgenome_1.16.5   DBI_0.2-5         RCurl_1.4-2       RSQLite_0.9-1
> rtracklayer_1.8.1 tools_2.11.1
> [10] XML_3.1-0
>
>
>
> 2010/7/14 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Erik, Vince,
>
>     I'm puzzled by this. Populating the db is made using prepared
>     statements which are usually very fast. I'm investigating and
>     will let you know. Thanks for the report.
>
>     H.
>
>
>
>     On 07/14/2010 06:30 AM, Vincent Carey wrote:
>
>         it seems to me that the speed issues are probably related to UCSC
>         server availability at the time of your session, which might
>         depend on
>         external factors.
>
>         however i can confirm a problem with the refGene request.   I got
>         further than a timeout
>
>         first i get a timing similar to yours for knownGene--
>
>             system.time(hg19KG<- makeTranscriptDbFromUCSC(genome =
>             "hg19", tablename
>
>         + = "knownGene"))
>             user  system elapsed
>           65.585   0.675 200.614
>
>         then (and this event at line 26744 is literally reproducible)
>
>             system.time(hg19KG<- makeTranscriptDbFromUCSC(genome =
>             "hg19", tablename
>
>         +  = "refGene" )
>         + )
>         Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>         na.strings,  :
>            line 26744 did not have 8 elements
>         Timing stopped at: 1.597 0.067 207.762
>
>         The GenomicFeatures developers will comment later.
>
>             sessionInfo()
>
>         R version 2.12.0 Under development (unstable) (2010-06-30 r52417)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>
>         locale:
>         [1] C
>
>         attached base packages:
>         [1] stats     graphics  grDevices datasets  utils     methods   base
>
>         other attached packages:
>         [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15  IRanges_1.7.9
>
>         loaded via a namespace (and not attached):
>           [1] BSgenome_1.17.5    Biobase_2.9.0      Biostrings_2.17.12
>         DBI_0.2-5
>           [5] RCurl_1.4-2        RSQLite_0.9-1      XML_3.1-0
>           biomaRt_2.5.1
>           [9] rtracklayer_1.9.3  tools_2.12.0
>
>
>         On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker
>         <erikvandenakker at gmail.com <mailto:erikvandenakker at gmail.com>>
>           wrote:
>
>             Hi all,
>
>             I'm a PhD student in bioinformatics working at the Leiden
>             University Medical
>
>             Center and at the Delft University of Technlogy in the
>             Netherlands.
>             Currently
>             I'm working on the vizualization of genome wide data
>             sources, such as
>             Linkage,
>             GWAS&  Expression data.
>             In order to be able to quickely access information on gene
>             locations (along
>             with the UTR, CDS, exons etc), I thought it would be a good
>             idea to make use
>
>             of the GenomicFeatures package. This package works perfectly
>             and very
>             quickely
>             for the example provided in the vignette (good job!):
>
>                 library(GenomicFeatures)
>                 system.time(mm9KG<- makeTranscriptDbFromUCSC(genome =
>                 "mm9", tablename =
>
>             "knownGene"))
>                user  system elapsed
>               49.50    0.69  100.05
>
>                 mm9KG
>
>             TranscriptDb object:
>             | Db type: TranscriptDb
>             | Data source: UCSC
>             | Genome: mm9
>             | UCSC Table: knownGene
>             | Type of Gene ID: Entrez Gene ID
>             | Full dataset: yes
>             | transcript_nrow: 49409
>             | exon_nrow: 237551
>             | cds_nrow: 204831
>             | Db created by: GenomicFeatures package from Bioconductor
>             | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010)
>             | GenomicFeatures version at creation time: 1.0.3
>             | RSQLite version at creation time: 0.9-1
>
>
>             And even for larger databases(humans), this works perfectly:
>
>                 system.time(hg19KG<- makeTranscriptDbFromUCSC(genome =
>                 "hg19", tablename
>
>             = "knownGene"))
>                user  system elapsed
>               82.09    1.11  162.53
>
>                 hg19KG
>
>             TranscriptDb object:
>             | Db type: TranscriptDb
>             | Data source: UCSC
>             | Genome: hg19
>             | UCSC Table: knownGene
>             | Type of Gene ID: Entrez Gene ID
>             | Full dataset: yes
>             | transcript_nrow: 77614
>             | exon_nrow: 281605
>             | cds_nrow: 236664
>             | Db created by: GenomicFeatures package from Bioconductor
>             | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010)
>             | GenomicFeatures version at creation time: 1.0.3
>             | RSQLite version at creation time: 0.9-1
>
>             However, for tablename = "refGene" I had to shoot down my R
>             session after
>             half an hour for both the settings genome = "mm9"&  genome =
>             "hg19"
>
>                 system.time(hg19KG<- makeTranscriptDbFromUCSC(genome =
>                 "mm9", tablename =
>
>             "refGene"))
>
>                 system.time(hg19KG<- makeTranscriptDbFromUCSC(genome =
>                 "hg19", tablename
>
>             = "refGene"))
>
>             As this package makes use of functionalities provided by
>             rtracklayer, before
>
>             the actual SQLite db is stored, I verified whether this was
>             working
>             correctly:
>
>                 library(rtracklayer)
>                 session<- browserSession()
>                 genome(session)<- "hg19"
>                 query<- ucscTableQuery(session,"refGene")
>                 system.time(Table<- getTable(query))
>
>               user  system elapsed
>                7.70    0.39   61.73
>
>             Typing "head(Table)" gave the expected results, suggesting
>             that something
>             is not working correctly in creating the SQLite databases.
>
>             So, my question:
>             Given that refGene pops up when using supportedUCSCtables(),
>             I wondered:
>             1) Did I do something wrong?; 2) should I just have more
>             patience&  3) could
>             anyone
>             confirm these problems?
>             And
>             @PackageMaintainers: If this is a genuine bug, are you
>             planning to fix this
>             or speed things up?
>
>             As I work with gene expression data, which are commonly
>             annotated to either
>             RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with
>             TranscriptDBs
>             based on these features. Although I can think of many work
>             around solutions
>             using "knownGene" I would prefer to work with the package as
>             originally
>             intended
>             and hence this post.
>
>             Thanks for the work already done on this great package!
>
>             Cheerz,
>
>             Erik van den Akker
>
>
>                 sessionInfo()
>
>             R version 2.11.1 (2010-05-31)
>             i386-pc-mingw32
>
>             locale:
>             [1] LC_COLLATE=Dutch_Netherlands.1252
>               LC_CTYPE=Dutch_Netherlands.1252
>             LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
>             [5] LC_TIME=Dutch_Netherlands.1252
>
>             attached base packages:
>             [1] stats     graphics  grDevices utils     datasets
>               methods   base
>
>             other attached packages:
>             [1] rtracklayer_1.8.1     RCurl_1.4-2           bitops_1.0-4.1
>             GenomicFeatures_1.0.3 GenomicRanges_1.0.5   IRanges_1.6.8
>
>             loaded via a namespace (and not attached):
>             [1] Biobase_2.8.0     biomaRt_2.4.0     Biostrings_2.16.7
>             BSgenome_1.16.5
>             DBI_0.2-5         RSQLite_0.9-1     tools_2.11.1      XML_3.1-0
>
>                     [[alternative HTML version deleted]]
>
>             _______________________________________________
>             Bioconductor mailing list
>             Bioconductor at stat.math.ethz.ch
>             <mailto:Bioconductor at stat.math.ethz.ch>
>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>             Search the archives:
>             http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at stat.math.ethz.ch
>         <mailto:Bioconductor at stat.math.ethz.ch>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list