[BioC] Working with ensembl 73 & BSGenome

Timothy Johnstone [guest] guest at bioconductor.org
Thu Oct 31 05:39:47 CET 2013

I'm working with the latest annotation set from Ensembl (ens73) which is based on the patched GRCh37.p12 assembly. I have retrieved the transcript set from Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart().

One of the things I'd like to do is create a DNAStringSet of sequences for all the transcripts in my transcriptDB using the GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and a BSGenomes object as input. However, the latest BSGenomes available for the human is UCSC.hg19, which is unpatched. When I run the command, I get the error:
Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i],  : 
  sequence ^1$ not found

I'm pretty sure this is because the transcriptDB contains sequences (patches/scaffolds) that are present in the patched assembly but not the base GRCh37 assembly. Additionally the nomenclature is different between UCSC and Ensembl (e.g. chr1 ; 1). 

I see a few options here. One obvious one would be to stick with UCSC hg19 and use the UCSC ensGene table, but others in my working group are using ens73 so this is a suboptimal solution. Is there an updated BSGenome available for GRCh37.p12, or an easy way to forge one? Have others encountered this issue?

Tim Johnstone

 -- output of sessionInfo(): 

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grDevices datasets  splines   tcltk     utils     parallel  stats     graphics  methods  
[10] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19  BiocInstaller_1.12.0               
 [3] data.table_1.8.10                   Hmisc_3.12-2                       
 [5] Formula_1.1-1                       survival_2.37-4                    
 [7] plyr_1.8                            gdata_2.13.2                       
 [9] ShortRead_1.20.0                    lattice_0.20-24                    
[11] rtracklayer_1.22.0                  Rsamtools_1.14.1                   
[13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0                    
[15] Biostrings_2.30.0                   lessR_2.9.7                        
[17] GenomicFeatures_1.14.0              AnnotationDbi_1.24.0               
[19] Biobase_2.22.0                      GenomicRanges_1.14.3               
[21] XVector_0.2.0                       IRanges_1.20.4                     
[23] BiocGenerics_0.8.0                 

loaded via a namespace (and not attached):
 [1] biomaRt_2.18.0      bitops_1.0-6        car_2.0-19          cluster_1.14.4     
 [5] DBI_0.2-7           foreign_0.8-57      grid_3.0.2          gtools_3.1.0       
 [9] hwriter_1.3         latticeExtra_0.6-26 leaps_2.9           MASS_7.3-29        
[13] MBESS_3.3.3         nnet_7.3-7          RColorBrewer_1.0-5  RCurl_1.95-4.1     
[17] rpart_4.1-3         RSQLite_0.11.4      stats4_3.0.2        tools_3.0.2        
[21] XML_3.95-0.2        zlibbioc_1.8.0 

Sent via the guest posting facility at bioconductor.org.

More information about the Bioconductor mailing list