[BioC] BSgenomes vs ENSEMBL

Sat Oct 18 00:04:40 CEST 2008

On Fri, Oct 17, 2008 at 5:16 PM, Hooiveld, Guido <Guido.Hooiveld at wur.nl> wrote:
>
> Dear list,
> I am a novice in genome builds and have therefore some basic questions.
>
> My ultimate goal is to identify the exact locations in the mouse genome
> of several 'fixed' sequences, e.g. how many times is this specific
> sequence "aaggggaaaaggtca", a putative transcription factor binding
> site, present in the mouse genome, and more importantly, which genes are
> closest to a match. After searching the archive I came to the conclusion
> that the libraries Biostrings + BSGenome likely can do what I am after.
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/1747
> 1
>
> I understand the mouse genome in BSgenome.Mmusculus.UCSC.mm9 is build
> based on data made available by the UCSC. I also noticed that the UCSC
> MM9 assembly is also known as NCBI Build 37. However, my co-worker
> always uses ENSEMBL to find info on genes...., but apparently ENSEMBL
> also uses the same assembly (i.e. NCBI m37 mouse). Therefore:
> - Am i correct; in other words, USCS and ENSEMBL use the same, identical
> genome assambly?
> - Thus only the annotation of the genome differs between UCSC and
> ENSEMBL?
> - As a result, I can use the Bs.genome.xxx.mm9 to identify the locations
> at the genome of a specific sequence, which I then can annotate using
> ENSEMBL to identify the gene(s) that are closest to a match? And what
> would be the best way of doing this? BiomaRt?

Everything you said above is correct.  And biomaRt would be a good choice.

Sean