[BioC] BSgenomes vs ENSEMBL

Sat Oct 18 21:48:28 CEST 2008

Thanks Herve and Sean for answering so quickly. However, before actually
starting there is already one thing I would like to know (from the
vignette of the Bsgenome package):
[quote]
5 Masking the chromosome sequences
Starting with Bioconductor 2.2, some BSgenome data packages provide
built-in masks for the chromosome sequences. For example, each
chromosome in BSgenome.Hsapiens.UCSC.hg18 has 3 masks on it: the mask of
assembly gaps, the mask of repeat regions that were determined by the
RepeatMasker software, and the mask of repeat regions that were
determined by the Tandem Repeats Finder software (where only repeats
with period less than or equal to 12 were kept).
[/quote]

Therefore, when doing a genome-wide scan, is it best to use the UNMASKED
sequences (=default), or would enabling masking provide better, more
biologically-relevant results? Again, I have a set of 18 (exact)
sequences of 15-17bp [=putative TF binding sites], of which I would like
to find their location in the mouse genome.

Thanks,
Guido

> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch 
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of 
> Herve Pages
> Sent: 18 October 2008 00:21
> To: Hooiveld, Guido
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] BSgenomes vs ENSEMBL
> 
> Hi Guido,
> 
> My understanding is that UCSC generally doesn't assemble a 
> genome themselves but get it from someone who assembled it 
> like NCBI in the case of hg18 (NCBI Build 36.1) or mm9 (NCBI 
> Build 37).
> See this table:
> 
>   http://genome.ucsc.edu/FAQ/FAQreleases#release1
> 
> The "RELEASE NAME" column tells you who assembled the genome. 
> As you can see, all the genomes provided by UCSC have been 
> assembled by someone else (except old genomes hg1 to hg8).
> If ENSEMBL claims that they use the "NCBI m37" assembly, one 
> might be confident that this means that they use the same 
> assembly as mm9 from UCSC. If this is the case, the 
> chromosome sequences should be strictly identical.
> 
> As for the annotations, yes, I would expect them to differ 
> between UCSC and ENSEMBL but someone more familiar with this 
> topic would need to confirm this.
> 
> So yes you could in principle (1) use 
> BSgenome.Mmusculus.UCSC.mm9 to find the locations of your 
> short sequences and then (2) annotate them with ENSEMBL 
> annotations. I don't know what would be the best way of doing 
> (2) though.
> 
> For (1) there are several options available in Biostrings 
> depending on the "size" of the problem (i.e. how many short 
> sequences you need to match/align, how big they are and how 
> big the reference genome is) and whether you want to do exact 
> matching, or allow some mismatches only or allow indels too.
> 
> See pairwiseAlignment() for finding the alignments of a small 
> number of short patterns against a small genome. It 
> implements a Smith-Waterman or Needleman-Wunsch algorithm so 
> replacements (aka mismatches) and indels are fully supported.
> 
> See matchPattern() for exact matching and inexact matching 
> (with a small number of mismatches only, no indels) of a 
> small number of short patterns against a small or big genome.
> 
> See matchPDict() for doing the same thing than matchPattern() 
> (with some restrictions though) but when you have a lot 
> (thousands or millions) of short patterns against a small or 
> big genome. (See this recent post on this list for some hints 
> on how to use matchPDict:
> https://stat.ethz.ch/pipermail/bioconductor/2008-October/024629.html
> )
> 
> Cheers,
> H.
> 
> 
> Hooiveld, Guido wrote:
> >  
> > Dear list,
> > I am a novice in genome builds and have therefore some 
> basic questions.
> >  
> > My ultimate goal is to identify the exact locations in the mouse 
> > genome of several 'fixed' sequences, e.g. how many times is this 
> > specific sequence "aaggggaaaaggtca", a putative 
> transcription factor 
> > binding site, present in the mouse genome, and more 
> importantly, which 
> > genes are closest to a match. After searching the archive I came to 
> > the conclusion that the libraries Biostrings + BSGenome 
> likely can do what I am after.
> > 
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/17
> > 47
> > 1
> >  
> > I understand the mouse genome in 
> BSgenome.Mmusculus.UCSC.mm9 is build 
> > based on data made available by the UCSC. I also noticed 
> that the UCSC
> > MM9 assembly is also known as NCBI Build 37. However, my co-worker 
> > always uses ENSEMBL to find info on genes...., but 
> apparently ENSEMBL 
> > also uses the same assembly (i.e. NCBI m37 mouse). Therefore:
> > - Am i correct; in other words, USCS and ENSEMBL use the same, 
> > identical genome assambly?
> > - Thus only the annotation of the genome differs between UCSC and 
> > ENSEMBL?
> > - As a result, I can use the Bs.genome.xxx.mm9 to identify the 
> > locations at the genome of a specific sequence, which I then can 
> > annotate using ENSEMBL to identify the gene(s) that are 
> closest to a 
> > match? And what would be the best way of doing this? BiomaRt?
> >  
> > Thanks,
> > Guido
> > 
> > ------------------------------------------------
> > Guido Hooiveld, PhD
> > Nutrition, Metabolism & Genomics Group Division of Human Nutrition 
> > Wageningen University Biotechnion, Bomenweg 2
> > NL-6703 HD Wageningen
> > the Netherlands
> > tel: (+)31 317 485788
> > fax: (+)31 317 483342 
> > internet:   http://nutrigene.4t.com <http://nutrigene.4t.com/>  
> > email:      guido.hooiveld at wur.nl 
> > 
> > 
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
>