[BioC] Working with ensembl 73 & BSGenome

Hervé Pagès hpages at fhcrc.org
Thu Oct 31 23:37:56 CET 2013

On 10/31/2013 01:33 AM, Martin Morgan wrote:
> On 10/30/2013 11:54 PM, Hervé Pagès wrote:
>> Hi Tim,
>> On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote:
>>> Hi,
>>> I'm working with the latest annotation set from Ensembl (ens73) which
>>> is based
>>> on the patched GRCh37.p12 assembly. I have retrieved the transcript
>>> set from
>>> Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart().
>>> One of the things I'd like to do is create a DNAStringSet of
>>> sequences for all
>>> the transcripts in my transcriptDB using the
>>> GenomicFeatures:extractTranscriptsFromGenome() function. This takes a
>>> TDB and
>>> a BSGenomes object as input. However, the latest BSGenomes available
>>> for the
>>> human is UCSC.hg19, which is unpatched. When I run the command, I get
>>> the error:
>>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],
>>> start[i],  :
>>>    sequence ^1$ not found
>>> I'm pretty sure this is because the transcriptDB contains sequences
>>> (patches/scaffolds) that are present in the patched assembly but not
>>> the base
>>> GRCh37 assembly. Additionally the nomenclature is different between
>>> UCSC and
>>> Ensembl (e.g. chr1 ; 1).
>> Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed
>> here:
>>    http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/
>>    http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/
>> the GRCh37 primary assembly (i.e. the exact sequence of the 24
>> chromosomes) has not changed between GRCh37 and GRCh37.p12.
>> Also according to the note here:
>> http://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Human&db=hg19
>> chrM is not the same in GRCh37 as in the UCSC hg19 assembly.
>> So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the
>> transcript sequences from a BSgenome package, and if restricting this
>> extraction to the 24 chromosomes is OK with you, then you can use the
>> BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use
>> different chromosome naming conventions, you will have to adjust the
>> chromosome names. Here is how to proceed:
>>    library(GenomicFeatures)
>>    txdb <- makeTranscriptDbFromBiomart(biomart="ensembl",
>> dataset="hsapiens_gene_ensembl")
>>    ex_by_tx <- exonsBy(txdb, by="tx")
>>    seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y")  # restrict
>>    seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx))  # rename
>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>    tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx)  # extract
> Herve -- is there a way to use the fasta file from Ensembl? Download,
> compress and index it using Rsamtools::razip and Rsamtools::indexFa. The
> data source is
>    http://www.ensembl.org/info/data/ftp/index.html
> Then use FaFile to reference the file and getSeq,FaFile-method to get
> sequences.
>    library(Rsamtools)
>    fa = FaFile("/path/to/GRCh37.72.dna.toplevel.fa.rz")
> and then I think
>    library(GenomicFeatures)
>    dnaList = lapply(seqlevels(ex_by_tx), function(lvl) {
>         ## getSeq on the FaFile
>         ## extractTranscripts with seq and ex_by_tx
>    }
>    ## munge to dnaList to DNAStringSet

Thanks Martin for the pointer to the Ensembl FASTA files.

Sure the "getSeq + extractTranscripts" road is worth exploring.
Maybe a simpler road though is the "scanFa + extractTranscripts" road.
With the latter, the main loop loads the full DNA sequences one at a
time from the FaFile object and calls extractTranscripts() on it.
The advanced capabilities of getSeq() are not needed for this and
lower level scanFa() is enough.

The "getSeq + extractTranscripts" road could try to minimize I/O
cost and memory usage by loading only the transcript regions from
each sequence. This is appealing but could make the code a little
bit more complicated. Ideally, it sounds like this complexity should
be handled by something like an "extractTranscripts" method for
FaFile objects (and extractTranscriptsFromGenome() should become
the "extractTranscripts" method for BSgenome objects) but we don't
have this yet.

So only exploring the simpler "scanFa + extractTranscripts" road
for now. Some bumps I ran into while exploring this road:

First it's hard to know which Ensembl FASTA file is most appropriate.
For example with the Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz file
(seems like a good candidate) I ran into the following gotchas:

   - The file contains only 287 sequences while the 'txdb' object
     obtained with makeTranscriptDbFromBiomart() reports transcripts
     on 648 distinct sequences. After some investigation I found that
     393 of these sequences are LRG sequences which are not considered
     "top-level" and thus not included in the
     Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz file.

   - One surprise is that even though the FASTA file uses the same
     sequence names as 'txdb', those names are just the prefixes of
     long and complicated record headers. For example, for chromosome
     1, 2, and patch HG1007_PATCH:

       > 1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 REF
       > 2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 REF
       > HG1007_PATCH dna:chromosome 
chromosome:GRCh37:HG1007_PATCH:243059660:243125680:1 PATCH_FIX

     However this was just a surprise but not an issue at all when
     accessing this file thru a FaFile object:

       file_path <- "Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz"
       indexFa(file_path) # the file is huge (1GB compressed, 30GB 
                          # so creating the index takes a while
       fafile <- FaFile(file_path)


       > seqlengths(fafile)[c(1, 12, 23)]
               1            2 HG1007_PATCH
       249250621    243199373    243135680

     Everything looks good :-)

   - Finally, a show stopper is the following error I get while trying
     to load chromosome 6 or any part of it or anything located after
     chromosome 6 in the file:

       > scanFa(fafile, param=GRanges("6", IRanges(1, 20)))
       Error in value[[3L]](cond) : key 62 (char '>') not in lookup table
         file: Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz

       > scanFa(fafile, param=GRanges("7", IRanges(1, 20)))
       Error in value[[3L]](cond) : key 62 (char '>') not in lookup table
         file: Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz

     Loading the chromosomes located before chromosome 6 (e.g. chrom 1-5
     and 10-22) seems to work.

Anyway, here is what the "scanFa + extractTranscripts" road looks like:

   file_path <- "Homo_sapiens.GRCh37.73.dna.toplevel.fa.gz"
   fafile <- FaFile(file_path)
   ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)

   shared_seqlevels <- intersect(seqlevels(fafile), seqlevels(ex_by_tx))

   ## The loop will work fine until it chokes on chromosome 6.
   txseqs_per_seqlevel <- lapply(shared_seqlevels, function(seqlevel) {
     ## Load the sequence from the FaFile object.
     param <- GRanges(seqlevel, IRanges(1, seqlengths(fafile)[[seqlevel]]))
     seq <- scanFa(fafile, param=param)[[1L]]

     ## Keep only genomic ranges of transcripts on the current sequence.
     ex_by_tx2 <- ex_by_tx
     seqlevels(ex_by_tx2, force=TRUE) <- seqlevel

     ## We need to filter out transcripts made of exons that come from both
     ## strands because extractTranscripts() doesn't support them.
     nrun <- elementLengths(runLength(strand(ex_by_tx2)))
     ex_by_tx2 <- ex_by_tx2[which(nrun == 1)]

     ## Call extractTranscripts().
     exonStarts <- start(ex_by_tx2)
     exonEnds <- end(ex_by_tx2)
     strand <- unlist(runValue(strand(ex_by_tx2)))
     txseqs <- extractTranscripts(seq, exonStarts, exonEnds, strand)
     names(txseqs) <- names(ex_by_tx2)

   ## Put together all the transcript sequences.
   txseqs <- do.call("c", txseqs_per_seqlevel)


>> If you want to extract all the transcripts reported in Ensembl 73 (i.e.
>> not only the 195381 transcripts located on the 24 chromosomes but also
>> the 18913 transcripts located on the alt loci and unplaced contigs),
>> then you'll need to forge a BSGenome data package for GRCh37.p12. Look
>> at the BSgenomeForge vignette in the BSgenome software package for how
>> to do this. It looks like one extra difficulty with GRCh37.p12 (and
>> previous GRCh37 patches) is that there doesn't seem to be an easy way
>> to download the complete assembly from NCBI as FASTA file(s). Last
>> time I checked (was around GRCh37.p7), it seemed that only the diff
>> information was available in some kind of specialized file format so
>> my impression was that one had to download the diff and then use it
>> to "patch" the GRCh37 sequences in order to produce the GRCh37.p12
>> FASTA sequences. All this didn't seem completely straightforward to me
>> but I didn't really spend enough time to know.
>> Ideally, we should improve the BSgenome infrastructure so that
>> we can also follow that diff model i.e. it would be great if we were
> Or use AnnotationHub to retrieve the fasta and gtf files and smooth out
> the approach above? AnnotationHub is lagging a little as we try to
> improve the behind-the-scenes infrastructure, but the indexed FaFile
> (and gtf) for release 72 is available with
>    library(AnnotationHub)
>    hub = AnnotationHub()
>    fa =
> hub$ensembl.release.72.fasta.homo_sapiens.dna.Homo_sapiens.GRCh37.72.dna.toplevel.fa.rz
>    gtf =
> hub$ensembl.release.72.gtf.homo_sapiens.Homo_sapiens.GRCh37.72.gtf_0.0.1.RData
> Martin
>> able to forge light-weight BSgenome data packages that only contain
>> the diff against another BSgenome data package. For example
>> BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight
>> package that only contains the diff against (and depends on)
>> BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package
>> (i.e. it contains the full sequences). From a user point of view
>> there would be no difference between light-weight and heavy-weight,
>> that is, both would contain a BSgenome object that looks like it
>> contains a full assembly. But from a data management point of view
>> that would make a big difference as we would then be able to keep
>> up with the pace of GRCh37 patches by providing one BSgenome data
>> package for each of them at a low cost.
>> Cheers,
>> H.
>>> I see a few options here. One obvious one would be to stick with UCSC
>>> hg19 and
>>> use the UCSC ensGene table, but others in my working group are using
>>> ens73 so
>>> this is a suboptimal solution. Is there an updated BSGenome available
>>> for
>>> GRCh37.p12, or an easy way to forge one? Have others encountered this
>>> issue?
>>> Thanks!
>>> Tim Johnstone
>>>   -- output of sessionInfo():
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>> attached base packages:
>>>   [1] grDevices datasets  splines   tcltk     utils     parallel  stats
>>> graphics  methods
>>> [10] base
>>> other attached packages:
>>>   [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19  BiocInstaller_1.12.0
>>>   [3] data.table_1.8.10                   Hmisc_3.12-2
>>>   [5] Formula_1.1-1                       survival_2.37-4
>>>   [7] plyr_1.8                            gdata_2.13.2
>>>   [9] ShortRead_1.20.0                    lattice_0.20-24
>>> [11] rtracklayer_1.22.0                  Rsamtools_1.14.1
>>> [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0
>>> [15] Biostrings_2.30.0                   lessR_2.9.7
>>> [17] GenomicFeatures_1.14.0              AnnotationDbi_1.24.0
>>> [19] Biobase_2.22.0                      GenomicRanges_1.14.3
>>> [21] XVector_0.2.0                       IRanges_1.20.4
>>> [23] BiocGenerics_0.8.0
>>> loaded via a namespace (and not attached):
>>>   [1] biomaRt_2.18.0      bitops_1.0-6        car_2.0-19
>>> cluster_1.14.4
>>>   [5] DBI_0.2-7           foreign_0.8-57      grid_3.0.2
>>> gtools_3.1.0
>>>   [9] hwriter_1.3         latticeExtra_0.6-26 leaps_2.9
>>> MASS_7.3-29
>>> [13] MBESS_3.3.3         nnet_7.3-7          RColorBrewer_1.0-5
>>> RCurl_1.95-4.1
>>> [17] rpart_4.1-3         RSQLite_0.11.4      stats4_3.0.2
>>> tools_3.0.2
>>> [21] XML_3.95-0.2        zlibbioc_1.8.0
>>> --
>>> Sent via the guest posting facility at bioconductor.org.

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioconductor mailing list