[BioC] GO with R and Bioconductor

Mon Feb 28 20:07:30 CET 2011

On 2/26/11 8:09 PM, Vincent Carey wrote:
> On Sat, Feb 26, 2011 at 6:40 PM, Duke<duke.lists at gmx.com>  wrote:
>> Hi Vincent,
>>
>> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com>    wrote:
>>>> Dear colleagues,
>>>>
>>>>   I used to download GO database from geneoncology.org and did some c++
>>>> coding to manipulate the data as I wished. Now I want to try my luck with R
>>>> - Bioconductor. I have heard of tons of tools supporting GO such as GO.db,
>>>> topGO, goseq, GOstats, biomart etc... and I have been reading their
>>>> description and examples, but honestly I am overhelmed and dont really know
>>>> which package I should use to fulfill my task. So please advise me how I can
>>>> do the following two simple tasks:
>>>>
>>>>   1. I have a list of genes (with gene names from UCSC such as Foxp3
>>>> etc...). How do I filter this list to get genes that have certain GO term
>>>> such as transcription factor?
>>> since you said it was a simple task, consider the simple solution
>>> involving the "%annTo%" operator, which tells whether the symbols on
>>> the left have been annotated to the term on the right:
>>>
>>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>> FOXP3 BRCA2
>>> FALSE  TRUE
>>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>> FOXP3 BRCA2
>>>   TRUE FALSE
>>>
>>> you could use the named logical vectors generated in this way to
>>> perform the filtering you describe.  but see below.
>>>
>>>>   2. How do I know the capacity of the latest GO database on bioconductor,
>>>> for example, how many genes available for mm9, and how many of them have GO
>>>> term transcription factor?
>>> The "GO database" concerns the gene ontology, a structure of terms and
>>> relationships among them.  The association of GO terms to gene names
>>> for mouse is presented in various ways, but the most basic one is in
>>> the org.Mm.eg.db package.  With that, you could use
>>>
>>> org.Mm.eg()
>>>
>>> to find, among other statistics,
>>>
>>> org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
>>> concerning transcription factor mapping is not completely precise, and
>>> you might want to survey the family of GO terms to come up with a set
>>> of terms that meets your requirement.
>> You are right, I did not make it clear. What I wanted was exactly what you
>> said: I need to get a list of genes that have a certain set of terms (for
>> example *transcription*) - which corresponds to a family of GO terms, not
>> just one GO term.
>>
>>>    Here's a
>>> demonstration of related queries:
>>>
>>>> get("GO:0003700", GOTERM)
>>> GOID: GO:0003700
>>> Term: sequence-specific DNA binding transcription factor activity
>>> Ontology: MF
>>> Definition: Interacting selectively and non-covalently with a specific
>>>      DNA sequence in order to modulate transcription. The transcription
>>>      factor may or may not also interact selectively with a protein or
>>>      macromolecular complex.
>>> Synonym: GO:0000130
>>> Secondary: GO:0000130
>>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>>> length(tfg)
>>> [1] 940
>> This works fine in case of one GO term (GO: 0003700). Is there a similar
>> function like getterm("transcription", revmap()) to get all the genes that
>> their GO terms contain *transcription*?
> As far as I know there is no such function.  Solutions depend on the
> type of wild-card you are looking for.  If SQL LIKE operator is
> adequate
>
> termlikeToTags = function (liketok)
> {
>      require(GO.db)
>      gcon = GO_dbconn()
>      as.character(dbGetQuery(gcon, paste("select go_id from go_term
> where term like '%",
>          liketok, "%'", sep = ""))[[1]])
> }
>
> if you want to use full regular expressions
>
> reToTags = function (retok, ...) {
>      require(GO.db)
>      allt = dbGetQuery(GO_dbconn(), "select  go_id, term from go_term")
>      inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...)
> # will error if value set at call
>      if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
>      stop("retok did not grep to any term")
> }
>
>> length(reToTags("transcription"))
> [1] 372
>> length(termlikeToTags("transcription"))
> [1] 372
>> length(reToTags("transcrip.*"))
> [1] 411
>> length(termlikeToTags("transcrip.*"))
> [1] 0
>

As far as I understand, these functions will list out all the GO Terms 
containing "transcription" for example. In order to search for all of 
the genes in the database that have these GO Terms, I suppose I will 
have to loop

tfg = get("GO:0003700", revmap(org.Mm.egGO))
length(tfg)

for each of the term and then sum them up? Since your %annTo% function 
is to check a certain gene list to see if any of them has the specified 
term, would it be better to just run that function to the available 
genes in the database? How do I list and use the gene list in the database?

>>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
>>> revmap reverses this mapping and takes a tag to the set of genes
>>> mapped to the tag by entrez.
>>>
>>> Now, to return to the first question -- it isn't simple and a lot of
>>> presuppositions have to be made explicit.  One of the most problematic
>>> is the commitment to use gene symbols.  If you don't read the docs
>>> about bioconductor annotation and R packages pertaining thereto, it's
>>> hard to make progress.
>> Can you elaborate a little bit more why using gene symbols is problematic?
>> And if it is really a problem, what gene ID I should use to avoid that
>> problem?
> Entrez gene IDs are generally more specific.  If you haven't run into
> collisions and ambiguities with symbols by now, maybe you don't have
> to worry about it.
>

Well, our reference database is UCSC refFlat downloaded from UCSC genome 
browser, and in that format there is only geneName (SYMBOL) and name 
(refSEQ). There is no Entrez Gene ID in that file. We actually have a 
problem of one same geneName (SYMBOL) has more than one isoforms (more 
than one refSEQ), but in fact we dont have a good way of differentiate 
between those (by mathematical methods). So we use either gene symbol or 
refSeq ID.

>>>    My code defining "%annTo%" will follow -- it
>>> is not particularly efficient or well-designed, but it shows the
>>> components that have to be identified and used for this to work.  All
>>> the annotation is based on SQLite tables distributed with Bioconductor
>>> packages, with interfaces defined in DBI and RSQLite packages.  The
>>> %annTo% operator is defined below.  More thoughtful solutions are
>>> invited.
>>>
>>> sym2terms = function (sym)
>>> {
>>>      if (length(sym)>1) stop("scalar symbol input required") # bad form
>>>      require(GO.db)
>>>      require(org.Hs.eg.db)  # generalize
>>>      egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
>>>      gos = mget(egs, org.Hs.egGO)
>>>      goids = sapply(gos, lapply, "[[", "GOID")
>>>      sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
>>> }
>>>
>>> genesAnnotatedTo = function(syms, term="transcription factor",
>>>     meth=agrep) {
>>>     tms = lapply(syms, sym2terms)
>>>     chk = sapply(tms, function(z) length(agrep(term, z))>0)
>>>     names(chk) = syms
>>>     chk
>>> }
>>>
>>> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>>>
>> Thanks for this %annTo% function. It works perfectly!
>>
>> Regards,
>>
>> D.
>>