[BioC] GO with R and Bioconductor

Vincent Carey stvjc at channing.harvard.edu
Sun Feb 27 02:09:05 CET 2011


On Sat, Feb 26, 2011 at 6:40 PM, Duke <duke.lists at gmx.com> wrote:
> Hi Vincent,
>
> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>
>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com>  wrote:
>>>
>>> Dear colleagues,
>>>
>>>  I used to download GO database from geneoncology.org and did some c++
>>> coding to manipulate the data as I wished. Now I want to try my luck with R
>>> - Bioconductor. I have heard of tons of tools supporting GO such as GO.db,
>>> topGO, goseq, GOstats, biomart etc... and I have been reading their
>>> description and examples, but honestly I am overhelmed and dont really know
>>> which package I should use to fulfill my task. So please advise me how I can
>>> do the following two simple tasks:
>>>
>>>  1. I have a list of genes (with gene names from UCSC such as Foxp3
>>> etc...). How do I filter this list to get genes that have certain GO term
>>> such as transcription factor?
>>
>> since you said it was a simple task, consider the simple solution
>> involving the "%annTo%" operator, which tells whether the symbols on
>> the left have been annotated to the term on the right:
>>
>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>
>> FOXP3 BRCA2
>> FALSE  TRUE
>>>
>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>
>> FOXP3 BRCA2
>>  TRUE FALSE
>>
>> you could use the named logical vectors generated in this way to
>> perform the filtering you describe.  but see below.
>>
>>>  2. How do I know the capacity of the latest GO database on bioconductor,
>>> for example, how many genes available for mm9, and how many of them have GO
>>> term transcription factor?
>>
>> The "GO database" concerns the gene ontology, a structure of terms and
>> relationships among them.  The association of GO terms to gene names
>> for mouse is presented in various ways, but the most basic one is in
>> the org.Mm.eg.db package.  With that, you could use
>>
>> org.Mm.eg()
>>
>> to find, among other statistics,
>>
>> org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
>> concerning transcription factor mapping is not completely precise, and
>> you might want to survey the family of GO terms to come up with a set
>> of terms that meets your requirement.
>
> You are right, I did not make it clear. What I wanted was exactly what you
> said: I need to get a list of genes that have a certain set of terms (for
> example *transcription*) - which corresponds to a family of GO terms, not
> just one GO term.
>
>>   Here's a
>> demonstration of related queries:
>>
>>> get("GO:0003700", GOTERM)
>>
>> GOID: GO:0003700
>> Term: sequence-specific DNA binding transcription factor activity
>> Ontology: MF
>> Definition: Interacting selectively and non-covalently with a specific
>>     DNA sequence in order to modulate transcription. The transcription
>>     factor may or may not also interact selectively with a protein or
>>     macromolecular complex.
>> Synonym: GO:0000130
>> Secondary: GO:0000130
>>>
>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>> length(tfg)
>>
>> [1] 940
>
> This works fine in case of one GO term (GO: 0003700). Is there a similar
> function like getterm("transcription", revmap()) to get all the genes that
> their GO terms contain *transcription*?

As far as I know there is no such function.  Solutions depend on the
type of wild-card you are looking for.  If SQL LIKE operator is
adequate

termlikeToTags = function (liketok)
{
    require(GO.db)
    gcon = GO_dbconn()
    as.character(dbGetQuery(gcon, paste("select go_id from go_term
where term like '%",
        liketok, "%'", sep = ""))[[1]])
}

if you want to use full regular expressions

reToTags = function (retok, ...) {
    require(GO.db)
    allt = dbGetQuery(GO_dbconn(), "select  go_id, term from go_term")
    inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...)
# will error if value set at call
    if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
    stop("retok did not grep to any term")
}

> length(reToTags("transcription"))
[1] 372
> length(termlikeToTags("transcription"))
[1] 372
> length(reToTags("transcrip.*"))
[1] 411
> length(termlikeToTags("transcrip.*"))
[1] 0

>
>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
>> revmap reverses this mapping and takes a tag to the set of genes
>> mapped to the tag by entrez.
>>
>> Now, to return to the first question -- it isn't simple and a lot of
>> presuppositions have to be made explicit.  One of the most problematic
>> is the commitment to use gene symbols.  If you don't read the docs
>> about bioconductor annotation and R packages pertaining thereto, it's
>> hard to make progress.
>
> Can you elaborate a little bit more why using gene symbols is problematic?
> And if it is really a problem, what gene ID I should use to avoid that
> problem?

Entrez gene IDs are generally more specific.  If you haven't run into
collisions and ambiguities with symbols by now, maybe you don't have
to worry about it.

>
>>   My code defining "%annTo%" will follow -- it
>> is not particularly efficient or well-designed, but it shows the
>> components that have to be identified and used for this to work.  All
>> the annotation is based on SQLite tables distributed with Bioconductor
>> packages, with interfaces defined in DBI and RSQLite packages.  The
>> %annTo% operator is defined below.  More thoughtful solutions are
>> invited.
>>
>> sym2terms = function (sym)
>> {
>>     if (length(sym)>1) stop("scalar symbol input required") # bad form
>>     require(GO.db)
>>     require(org.Hs.eg.db)  # generalize
>>     egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
>>     gos = mget(egs, org.Hs.egGO)
>>     goids = sapply(gos, lapply, "[[", "GOID")
>>     sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
>> }
>>
>> genesAnnotatedTo = function(syms, term="transcription factor",
>>    meth=agrep) {
>>    tms = lapply(syms, sym2terms)
>>    chk = sapply(tms, function(z) length(agrep(term, z))>0)
>>    names(chk) = syms
>>    chk
>> }
>>
>> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>>
>
> Thanks for this %annTo% function. It works perfectly!
>
> Regards,
>
> D.
>



More information about the Bioconductor mailing list