[BioC] GO with R and Bioconductor

Sat Feb 26 20:38:05 CET 2011

On Sat, Feb 26, 2011 at 12:00 PM,  <duke.lists at gmx.com> wrote:
> Dear colleagues,
>
>  I used to download GO database from geneoncology.org and did some c++ coding to manipulate the data as I wished. Now I want to try my luck with R - Bioconductor. I have heard of tons of tools supporting GO such as GO.db, topGO, goseq, GOstats, biomart etc... and I have been reading their description and examples, but honestly I am overhelmed and dont really know which package I should use to fulfill my task. So please advise me how I can do the following two simple tasks:
>
>  1. I have a list of genes (with gene names from UCSC such as Foxp3 etc...). How do I filter this list to get genes that have certain GO term such as transcription factor?

since you said it was a simple task, consider the simple solution
involving the "%annTo%" operator, which tells whether the symbols on
the left have been annotated to the term on the right:

> c("FOXP3", "BRCA2") %annTo% "mammary"
FOXP3 BRCA2
FALSE  TRUE
> c("FOXP3", "BRCA2") %annTo% "transcription factor"
FOXP3 BRCA2
 TRUE FALSE

you could use the named logical vectors generated in this way to
perform the filtering you describe.  but see below.

>
>  2. How do I know the capacity of the latest GO database on bioconductor, for example, how many genes available for mm9, and how many of them have GO term transcription factor?

The "GO database" concerns the gene ontology, a structure of terms and
relationships among them.  The association of GO terms to gene names
for mouse is presented in various ways, but the most basic one is in
the org.Mm.eg.db package.  With that, you could use

org.Mm.eg()

to find, among other statistics,

org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
concerning transcription factor mapping is not completely precise, and
you might want to survey the family of GO terms to come up with a set
of terms that meets your requirement.  Here's a
demonstration of related queries:

> get("GO:0003700", GOTERM)
GOID: GO:0003700
Term: sequence-specific DNA binding transcription factor activity
Ontology: MF
Definition: Interacting selectively and non-covalently with a specific
    DNA sequence in order to modulate transcription. The transcription
    factor may or may not also interact selectively with a protein or
    macromolecular complex.
Synonym: GO:0000130
Secondary: GO:0000130
> tfg = get("GO:0003700", revmap(org.Mm.egGO))
> length(tfg)
[1] 940

org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
revmap reverses this mapping and takes a tag to the set of genes
mapped to the tag by entrez.

Now, to return to the first question -- it isn't simple and a lot of
presuppositions have to be made explicit.  One of the most problematic
is the commitment to use gene symbols.  If you don't read the docs
about bioconductor annotation and R packages pertaining thereto, it's
hard to make progress.  My code defining "%annTo%" will follow -- it
is not particularly efficient or well-designed, but it shows the
components that have to be identified and used for this to work.  All
the annotation is based on SQLite tables distributed with Bioconductor
packages, with interfaces defined in DBI and RSQLite packages.  The
%annTo% operator is defined below.  More thoughtful solutions are
invited.

sym2terms = function (sym)
{
    if (length(sym)>1) stop("scalar symbol input required") # bad form
    require(GO.db)
    require(org.Hs.eg.db)  # generalize
    egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
    gos = mget(egs, org.Hs.egGO)
    goids = sapply(gos, lapply, "[[", "GOID")
    sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
}

genesAnnotatedTo = function(syms, term="transcription factor",
   meth=agrep) {
   tms = lapply(syms, sym2terms)
   chk = sapply(tms, function(z) length(agrep(term, z))>0)
   names(chk) = syms
   chk
}

"%annTo%" = function(x,y) genesAnnotatedTo(x,y)

>
>  Off note, I have never tried any of those GO packages, so any advise/suggestion are welcome. Thank you so much in advance,
>
>  D.
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>