[BioC] BSgenomes vs ENSEMBL
hpages at fhcrc.org
Sat Oct 18 00:20:48 CEST 2008
My understanding is that UCSC generally doesn't assemble a genome
themselves but get it from someone who assembled it like NCBI in the
case of hg18 (NCBI Build 36.1) or mm9 (NCBI Build 37).
See this table:
The "RELEASE NAME" column tells you who assembled the genome. As you
can see, all the genomes provided by UCSC have been assembled by
someone else (except old genomes hg1 to hg8).
If ENSEMBL claims that they use the "NCBI m37" assembly, one might be
confident that this means that they use the same assembly as mm9 from
UCSC. If this is the case, the chromosome sequences should be strictly
As for the annotations, yes, I would expect them to differ between UCSC
and ENSEMBL but someone more familiar with this topic would need to
So yes you could in principle (1) use BSgenome.Mmusculus.UCSC.mm9 to
find the locations of your short sequences and then (2) annotate them
with ENSEMBL annotations. I don't know what would be the best way of
doing (2) though.
For (1) there are several options available in Biostrings depending on
the "size" of the problem (i.e. how many short sequences you need to
match/align, how big they are and how big the reference genome is) and
whether you want to do exact matching, or allow some mismatches only
or allow indels too.
See pairwiseAlignment() for finding the alignments of a small number of
short patterns against a small genome. It implements a Smith-Waterman or
Needleman-Wunsch algorithm so replacements (aka mismatches) and indels
are fully supported.
See matchPattern() for exact matching and inexact matching (with a small
number of mismatches only, no indels) of a small number of short patterns
against a small or big genome.
See matchPDict() for doing the same thing than matchPattern() (with some
restrictions though) but when you have a lot (thousands or millions) of
short patterns against a small or big genome. (See this recent post on this
list for some hints on how to use matchPDict:
Hooiveld, Guido wrote:
> Dear list,
> I am a novice in genome builds and have therefore some basic questions.
> My ultimate goal is to identify the exact locations in the mouse genome
> of several 'fixed' sequences, e.g. how many times is this specific
> sequence "aaggggaaaaggtca", a putative transcription factor binding
> site, present in the mouse genome, and more importantly, which genes are
> closest to a match. After searching the archive I came to the conclusion
> that the libraries Biostrings + BSGenome likely can do what I am after.
> I understand the mouse genome in BSgenome.Mmusculus.UCSC.mm9 is build
> based on data made available by the UCSC. I also noticed that the UCSC
> MM9 assembly is also known as NCBI Build 37. However, my co-worker
> always uses ENSEMBL to find info on genes...., but apparently ENSEMBL
> also uses the same assembly (i.e. NCBI m37 mouse). Therefore:
> - Am i correct; in other words, USCS and ENSEMBL use the same, identical
> genome assambly?
> - Thus only the annotation of the genome differs between UCSC and
> - As a result, I can use the Bs.genome.xxx.mm9 to identify the locations
> at the genome of a specific sequence, which I then can annotate using
> ENSEMBL to identify the gene(s) that are closest to a match? And what
> would be the best way of doing this? BiomaRt?
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> internet: http://nutrigene.4t.com <http://nutrigene.4t.com/>
> email: guido.hooiveld at wur.nl
> [[alternative HTML version deleted]]
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor