[BioC] Bacterial genomics

Wed Aug 11 21:28:46 CEST 2010

Hi Mathieu,

Some tools in Bioconductor that you might find helpful for this:

   - We provide full genome sequences pre-packaged in so-called
     "BSgenome data packages". There is one for Ecoli and I guess
     it would be easy for you to make one using the two complete
     Lactobacillus reuterei genomes available at NCBI.
     Use available.genomes() (from the BSgenome software package)
     to get the list of genomes that we support. See the BSgenomeForge
     vignette in the same package for how to make your own "BSgenome
     data package".
     Note that working with this type of package makes things convenient
     but is not required. You can also load the genome sequences
     "by hand" directly in your session with the read.DNAStringSet()
     function from the Biostrings package.

   - Try to use matchPattern() from Biostrings with e.g. max.mismatch=50
     and with.indels=TRUE to align each contig to the reference genome.
     You only have 100 contigs and the reference genome is small so you
     should not run into performance issues. Increase the value of
     max.mismatch when a contig doesn't align or break the contig into
     small pieces and align each piece separately.
     If the reference genome is not a single sequence but is made of
     a lot of small sequences (like contigs), then you might want to
     try vmatchPattern() for a slightly faster way to loop over these
     small reference sequences.

   - To extract all the details for a given alignment (i.e. mismatches,
     indels, gaps, score, etc...), you can use pairwiseAlignment() from
     Biostrings. Make sure the 'subject' you pass to it is the region
     identified previously by matchPattern(), not the full genome,
     otherwise you might run into performance or memory usage issues.

Finally, you might get more/better advice by asking on the Bioc-sig-seq
mailing list <Bioc-sig-sequencing at r-project.org>.

Cheers,
H.

On 08/10/2010 07:45 AM, Mathieu Parent wrote:
> Hi,
>
> I am currently working on a project where we used 454 sequencing to get the
> complete genome of a bacterial strain, a Lactobacillus reuterei. The results
> came back as about 100 contigs, for a total of 1.8Mb over the total 1.8Mb of
> the genome.
>
> My question is to know if I could use bioconductor to run an alignement of
> the two complete published genomes from NCBI and then get an alignment of my
> contigs on this.
>
> The goal is to develop a strain specific qPCR primer that will cover a
> sequence unique for our genome.
>
> Two approaches.
> 1. I could blast the whole genome in small chunks, rank it in score and
> select the top100 or top50
> 2. I could align them, visualise it with a genome browser and find deletions
> or unique sections.
>
> I have experience with bioconductor for microarray analysis and an
> intermediate knowledge of R.
>
> Thanks for any advice you all might have on the approach as well as the
> packages I could possibly use to execute them.
>
> Best Regards,
> Mathieu
> McGill University
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319