[BioC] Whole genome amino acid composition package?

Hervé Pagès hpages at fhcrc.org
Mon Oct 21 20:53:45 CEST 2013

Hi Bernardo,

On 10/21/2013 03:41 AM, Bernardo Bello wrote:
> Hi there,
> I'm interested in a statistical tool to get bacterial amino acid
> composition and genetic code usage at genome level.
> I've looking in MESH terms database but I'm a bit lost.
> http://www.ncbi.nlm.nih.gov/pubmed?term=%28%22Genetic%20Code%22%5BMesh%5D%29%20AND%20%22Software%22%5BMesh%5D

Assuming you've access to the annotated DNA sequence for your bacteria,
you would need:

   (a) A way to load the bacterial genome in Bioconductor as a
       DNAString object.

       Some suggested tools:
         - readDNAStringSet() in Biostrings.
         - scanFa() in Rsamtools.
         - import.2bit() in rtracklayer.
         - The biomaRt package to retrieve the bacterial genome from
           an online database via a Mart service.
         - You might find if convenient to wrap your bacterial genomes
           into a BSgenome data package like we've done for E. coli
           (see the BSgenome.Ecoli.NCBI.20080805 package).

   (b) A way to load the gene/CDS coordinates as a GRanges or GRangesList
       object. You need to make sure those coordinates are relative to
       the exact same genome assembly you used in (a).

       Some suggested tools:
         - import.gff() and import.bed() in rtracklayer.
         - makeTranscriptDbFromGFF() in GenomicFeatures.
         - The biomaRt package to retrieve the gene coordinates from
           an online database via a Mart service. Actually, if you're
           going to use biomaRt, you can skip that step and retrieve
           directly the gene/CDS sequences. See (c).

   (c) A way to extract the gene/CDS sequences as a DNAStringSet object.
       One gotcha for this step is to make sure that the sequences are
       extracted from the right strand i.e. if a gene is on the minus
       strand and its sequence is extracted from the plus strand, then
       the sequence needs to be reverse complemented (with
       reverseComplement() from the Biostrings package).

       Some suggested tools:
         - extractAt() from the Biostrings package. Does NOT handle
           the strand for you (i.e. you need to reverse complement
           sequences for genes on the minus strand).
         - The biomaRt package to retrieve the gene/CDS sequences from
           an online database via a Mart service.
         - getSeq() from the BSgenome package (if the DNA sequences are
           stored in a BSgenome data package). It does handle the
           strand for you.
         - Just mentioning it FYI: for more complex organisms
           with introns and UTRs, extractTranscriptsFromGenome() in
           GenomicFeatures would be the tool to use to extract the
           transcript or CDS sequences from a BSgenome object and a
           TranscriptDb object. It handles the strand and removes the
           introns for you.

    (d) Once you have the gene/CDS sequences in a DNAStringSet object,
        you can use translate() (from Biostrings) to translate to
        proteins. Right now it uses the Standard Genetic Code to
        translate but it would be easy to add support for alternate
        genetic codes. The result of the translation is an AAStringSet
        object. You can also use extractAt() on the DNAStringSet object
        to extract the codon composition of the gene/CDS sequences as
        a DNAStringSetList object. A simple table(unlist()) on the
        DNAStringSetList object will give you condon usage.

I don't know if there is already a package in Bioconductor or on CRAN
that does the above for you but all the building blocks are here, and
it should not be too hard to put them together. If you decide to go that
route and need more help, please come back, hopefully this time with
more specific questions.


> Thanks for your help. All the best, Bernardo
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioconductor mailing list