[BioC] Working with ensembl 73 & BSGenome

Thu Oct 31 07:54:48 CET 2013

Hi Tim,

On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote:
> Hi,
> I'm working with the latest annotation set from Ensembl (ens73) which is based on the patched GRCh37.p12 assembly. I have retrieved the transcript set from Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart().
>
> One of the things I'd like to do is create a DNAStringSet of sequences for all the transcripts in my transcriptDB using the GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and a BSGenomes object as input. However, the latest BSGenomes available for the human is UCSC.hg19, which is unpatched. When I run the command, I get the error:
> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i],  :
>    sequence ^1$ not found
>
> I'm pretty sure this is because the transcriptDB contains sequences (patches/scaffolds) that are present in the patched assembly but not the base GRCh37 assembly. Additionally the nomenclature is different between UCSC and Ensembl (e.g. chr1 ; 1).

Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed
here:

   http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/
   http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/

the GRCh37 primary assembly (i.e. the exact sequence of the 24
chromosomes) has not changed between GRCh37 and GRCh37.p12.

Also according to the note here:

   http://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Human&db=hg19

chrM is not the same in GRCh37 as in the UCSC hg19 assembly.

So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the
transcript sequences from a BSgenome package, and if restricting this
extraction to the 24 chromosomes is OK with you, then you can use the
BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use
different chromosome naming conventions, you will have to adjust the
chromosome names. Here is how to proceed:

   library(GenomicFeatures)
   txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", 
dataset="hsapiens_gene_ensembl")
   ex_by_tx <- exonsBy(txdb, by="tx")
   seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y")  # restrict
   seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx))  # rename

   library(BSgenome.Hsapiens.UCSC.hg19)
   tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx)  # extract

If you want to extract all the transcripts reported in Ensembl 73 (i.e.
not only the 195381 transcripts located on the 24 chromosomes but also
the 18913 transcripts located on the alt loci and unplaced contigs),
then you'll need to forge a BSGenome data package for GRCh37.p12. Look
at the BSgenomeForge vignette in the BSgenome software package for how
to do this. It looks like one extra difficulty with GRCh37.p12 (and
previous GRCh37 patches) is that there doesn't seem to be an easy way
to download the complete assembly from NCBI as FASTA file(s). Last
time I checked (was around GRCh37.p7), it seemed that only the diff
information was available in some kind of specialized file format so
my impression was that one had to download the diff and then use it
to "patch" the GRCh37 sequences in order to produce the GRCh37.p12
FASTA sequences. All this didn't seem completely straightforward to me
but I didn't really spend enough time to know.

Ideally, we should improve the BSgenome infrastructure so that
we can also follow that diff model i.e. it would be great if we were
able to forge light-weight BSgenome data packages that only contain
the diff against another BSgenome data package. For example
BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight
package that only contains the diff against (and depends on)
BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package
(i.e. it contains the full sequences). From a user point of view
there would be no difference between light-weight and heavy-weight,
that is, both would contain a BSgenome object that looks like it
contains a full assembly. But from a data management point of view
that would make a big difference as we would then be able to keep
up with the pace of GRCh37 patches by providing one BSgenome data
package for each of them at a low cost.

Cheers,
H.

>
> I see a few options here. One obvious one would be to stick with UCSC hg19 and use the UCSC ensGene table, but others in my working group are using ens73 so this is a suboptimal solution. Is there an updated BSGenome available for GRCh37.p12, or an easy way to forge one? Have others encountered this issue?
>
> Thanks!
> Tim Johnstone
>
>   -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
>   [1] grDevices datasets  splines   tcltk     utils     parallel  stats     graphics  methods
> [10] base
>
> other attached packages:
>   [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19  BiocInstaller_1.12.0
>   [3] data.table_1.8.10                   Hmisc_3.12-2
>   [5] Formula_1.1-1                       survival_2.37-4
>   [7] plyr_1.8                            gdata_2.13.2
>   [9] ShortRead_1.20.0                    lattice_0.20-24
> [11] rtracklayer_1.22.0                  Rsamtools_1.14.1
> [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0
> [15] Biostrings_2.30.0                   lessR_2.9.7
> [17] GenomicFeatures_1.14.0              AnnotationDbi_1.24.0
> [19] Biobase_2.22.0                      GenomicRanges_1.14.3
> [21] XVector_0.2.0                       IRanges_1.20.4
> [23] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.18.0      bitops_1.0-6        car_2.0-19          cluster_1.14.4
>   [5] DBI_0.2-7           foreign_0.8-57      grid_3.0.2          gtools_3.1.0
>   [9] hwriter_1.3         latticeExtra_0.6-26 leaps_2.9           MASS_7.3-29
> [13] MBESS_3.3.3         nnet_7.3-7          RColorBrewer_1.0-5  RCurl_1.95-4.1
> [17] rpart_4.1-3         RSQLite_0.11.4      stats4_3.0.2        tools_3.0.2
> [21] XML_3.95-0.2        zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319