[BioC] GO with R and Bioconductor

Vincent Carey stvjc at channing.harvard.edu
Mon Feb 28 20:31:31 CET 2011


On Mon, Feb 28, 2011 at 2:07 PM, Duke <duke.lists at gmx.com> wrote:
> On 2/26/11 8:09 PM, Vincent Carey wrote:
>>
>> On Sat, Feb 26, 2011 at 6:40 PM, Duke<duke.lists at gmx.com>  wrote:
>>>
>>> Hi Vincent,
>>>
>>> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>>>
>>>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com>    wrote:
>>>>>
>>>>> Dear colleagues,
>>>>>
>>>>>  I used to download GO database from geneoncology.org and did some c++
>>>>> coding to manipulate the data as I wished. Now I want to try my luck
>>>>> with R
>>>>> - Bioconductor. I have heard of tons of tools supporting GO such as
>>>>> GO.db,
>>>>> topGO, goseq, GOstats, biomart etc... and I have been reading their
>>>>> description and examples, but honestly I am overhelmed and dont really
>>>>> know
>>>>> which package I should use to fulfill my task. So please advise me how
>>>>> I can
>>>>> do the following two simple tasks:
>>>>>
>>>>>  1. I have a list of genes (with gene names from UCSC such as Foxp3
>>>>> etc...). How do I filter this list to get genes that have certain GO
>>>>> term
>>>>> such as transcription factor?
>>>>
>>>> since you said it was a simple task, consider the simple solution
>>>> involving the "%annTo%" operator, which tells whether the symbols on
>>>> the left have been annotated to the term on the right:
>>>>
>>>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>>>
>>>> FOXP3 BRCA2
>>>> FALSE  TRUE
>>>>>
>>>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>>>
>>>> FOXP3 BRCA2
>>>>  TRUE FALSE
>>>>
>>>> you could use the named logical vectors generated in this way to
>>>> perform the filtering you describe.  but see below.
>>>>
>>>>>  2. How do I know the capacity of the latest GO database on
>>>>> bioconductor,
>>>>> for example, how many genes available for mm9, and how many of them
>>>>> have GO
>>>>> term transcription factor?
>>>>
>>>> The "GO database" concerns the gene ontology, a structure of terms and
>>>> relationships among them.  The association of GO terms to gene names
>>>> for mouse is presented in various ways, but the most basic one is in
>>>> the org.Mm.eg.db package.  With that, you could use
>>>>
>>>> org.Mm.eg()
>>>>
>>>> to find, among other statistics,
>>>>
>>>> org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
>>>> concerning transcription factor mapping is not completely precise, and
>>>> you might want to survey the family of GO terms to come up with a set
>>>> of terms that meets your requirement.
>>>
>>> You are right, I did not make it clear. What I wanted was exactly what
>>> you
>>> said: I need to get a list of genes that have a certain set of terms (for
>>> example *transcription*) - which corresponds to a family of GO terms, not
>>> just one GO term.
>>>
>>>>   Here's a
>>>> demonstration of related queries:
>>>>
>>>>> get("GO:0003700", GOTERM)
>>>>
>>>> GOID: GO:0003700
>>>> Term: sequence-specific DNA binding transcription factor activity
>>>> Ontology: MF
>>>> Definition: Interacting selectively and non-covalently with a specific
>>>>     DNA sequence in order to modulate transcription. The transcription
>>>>     factor may or may not also interact selectively with a protein or
>>>>     macromolecular complex.
>>>> Synonym: GO:0000130
>>>> Secondary: GO:0000130
>>>>>
>>>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>>>> length(tfg)
>>>>
>>>> [1] 940
>>>
>>> This works fine in case of one GO term (GO: 0003700). Is there a similar
>>> function like getterm("transcription", revmap()) to get all the genes
>>> that
>>> their GO terms contain *transcription*?
>>
>> As far as I know there is no such function.  Solutions depend on the
>> type of wild-card you are looking for.  If SQL LIKE operator is
>> adequate
>>
>> termlikeToTags = function (liketok)
>> {
>>     require(GO.db)
>>     gcon = GO_dbconn()
>>     as.character(dbGetQuery(gcon, paste("select go_id from go_term
>> where term like '%",
>>         liketok, "%'", sep = ""))[[1]])
>> }
>>
>> if you want to use full regular expressions
>>
>> reToTags = function (retok, ...) {
>>     require(GO.db)
>>     allt = dbGetQuery(GO_dbconn(), "select  go_id, term from go_term")
>>     inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...)
>> # will error if value set at call
>>     if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
>>     stop("retok did not grep to any term")
>> }
>>
>>> length(reToTags("transcription"))
>>
>> [1] 372
>>>
>>> length(termlikeToTags("transcription"))
>>
>> [1] 372
>>>
>>> length(reToTags("transcrip.*"))
>>
>> [1] 411
>>>
>>> length(termlikeToTags("transcrip.*"))
>>
>> [1] 0
>>
>
> As far as I understand, these functions will list out all the GO Terms
> containing "transcription" for example. In order to search for all of the
> genes in the database that have these GO Terms, I suppose I will have to
> loop
>
> tfg = get("GO:0003700", revmap(org.Mm.egGO))
> length(tfg)
>
> for each of the term and then sum them up? Since your %annTo% function is to
> check a certain gene list to see if any of them has the specified term,
> would it be better to just run that function to the available genes in the
> database? How do I list and use the gene list in the database?

I'd suggest you read the vignettes at
http://www.bioconductor.org/help/bioc-views/release/bioc/html/AnnotationDbi.html,
particularly http://www.bioconductor.org/packages/2.7/bioc/vignettes/AnnotationDbi/inst/doc/AnnotationDbi.pdf

The codes I provided are hints in the direction of solutions for the
family of questions you pose.

>
>>>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
>>>> revmap reverses this mapping and takes a tag to the set of genes
>>>> mapped to the tag by entrez.
>>>>
>>>> Now, to return to the first question -- it isn't simple and a lot of
>>>> presuppositions have to be made explicit.  One of the most problematic
>>>> is the commitment to use gene symbols.  If you don't read the docs
>>>> about bioconductor annotation and R packages pertaining thereto, it's
>>>> hard to make progress.
>>>
>>> Can you elaborate a little bit more why using gene symbols is
>>> problematic?
>>> And if it is really a problem, what gene ID I should use to avoid that
>>> problem?
>>
>> Entrez gene IDs are generally more specific.  If you haven't run into
>> collisions and ambiguities with symbols by now, maybe you don't have
>> to worry about it.
>>
>
> Well, our reference database is UCSC refFlat downloaded from UCSC genome
> browser, and in that format there is only geneName (SYMBOL) and name
> (refSEQ). There is no Entrez Gene ID in that file. We actually have a
> problem of one same geneName (SYMBOL) has more than one isoforms (more than
> one refSEQ), but in fact we dont have a good way of differentiate between
> those (by mathematical methods). So we use either gene symbol or refSeq ID.
>

most of our annotation packages have mappings that employ refseq identifiers.
the functions i provided can be readily altered to use refseq ids as
keys, and if you
introduce types for the identifier/term tokens, the functions can be
implemented as methods
that perform appropriately for inputs from different vocabularies.
but you will have to
learn how to use the mapping objects or the sqlite tables
productively.  Package GSEABase is
also a useful resource for working with collections of gene identifiers.

>>>>   My code defining "%annTo%" will follow -- it
>>>> is not particularly efficient or well-designed, but it shows the
>>>> components that have to be identified and used for this to work.  All
>>>> the annotation is based on SQLite tables distributed with Bioconductor
>>>> packages, with interfaces defined in DBI and RSQLite packages.  The
>>>> %annTo% operator is defined below.  More thoughtful solutions are
>>>> invited.
>>>>
>>>> sym2terms = function (sym)
>>>> {
>>>>     if (length(sym)>1) stop("scalar symbol input required") # bad form
>>>>     require(GO.db)
>>>>     require(org.Hs.eg.db)  # generalize
>>>>     egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
>>>>     gos = mget(egs, org.Hs.egGO)
>>>>     goids = sapply(gos, lapply, "[[", "GOID")
>>>>     sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
>>>> }
>>>>
>>>> genesAnnotatedTo = function(syms, term="transcription factor",
>>>>    meth=agrep) {
>>>>    tms = lapply(syms, sym2terms)
>>>>    chk = sapply(tms, function(z) length(agrep(term, z))>0)
>>>>    names(chk) = syms
>>>>    chk
>>>> }
>>>>
>>>> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>>>>
>>> Thanks for this %annTo% function. It works perfectly!
>>>
>>> Regards,
>>>
>>> D.
>>>
>
>



More information about the Bioconductor mailing list