[BioC] GO with R and Bioconductor

Tue Mar 1 00:06:12 CET 2011

On 2/28/11 2:31 PM, Vincent Carey wrote:
> On Mon, Feb 28, 2011 at 2:07 PM, Duke<duke.lists at gmx.com>  wrote:
>> On 2/26/11 8:09 PM, Vincent Carey wrote:
>>> On Sat, Feb 26, 2011 at 6:40 PM, Duke<duke.lists at gmx.com>    wrote:
>>>> Hi Vincent,
>>>>
>>>> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>>>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com>      wrote:
>>>>>> Dear colleagues,
>>>>>>
>>>>>>   I used to download GO database from geneoncology.org and did some c++
>>>>>> coding to manipulate the data as I wished. Now I want to try my luck
>>>>>> with R
>>>>>> - Bioconductor. I have heard of tons of tools supporting GO such as
>>>>>> GO.db,
>>>>>> topGO, goseq, GOstats, biomart etc... and I have been reading their
>>>>>> description and examples, but honestly I am overhelmed and dont really
>>>>>> know
>>>>>> which package I should use to fulfill my task. So please advise me how
>>>>>> I can
>>>>>> do the following two simple tasks:
>>>>>>
>>>>>>   1. I have a list of genes (with gene names from UCSC such as Foxp3
>>>>>> etc...). How do I filter this list to get genes that have certain GO
>>>>>> term
>>>>>> such as transcription factor?
>>>>> since you said it was a simple task, consider the simple solution
>>>>> involving the "%annTo%" operator, which tells whether the symbols on
>>>>> the left have been annotated to the term on the right:
>>>>>
>>>>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>>>> FOXP3 BRCA2
>>>>> FALSE  TRUE
>>>>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>>>> FOXP3 BRCA2
>>>>>   TRUE FALSE
>>>>>
>>>>> you could use the named logical vectors generated in this way to
>>>>> perform the filtering you describe.  but see below.
>>>>>
>>>>>>   2. How do I know the capacity of the latest GO database on
>>>>>> bioconductor,
>>>>>> for example, how many genes available for mm9, and how many of them
>>>>>> have GO
>>>>>> term transcription factor?
>>>>> The "GO database" concerns the gene ontology, a structure of terms and
>>>>> relationships among them.  The association of GO terms to gene names
>>>>> for mouse is presented in various ways, but the most basic one is in
>>>>> the org.Mm.eg.db package.  With that, you could use
>>>>>
>>>>> org.Mm.eg()
>>>>>
>>>>> to find, among other statistics,
>>>>>
>>>>> org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
>>>>> concerning transcription factor mapping is not completely precise, and
>>>>> you might want to survey the family of GO terms to come up with a set
>>>>> of terms that meets your requirement.
>>>> You are right, I did not make it clear. What I wanted was exactly what
>>>> you
>>>> said: I need to get a list of genes that have a certain set of terms (for
>>>> example *transcription*) - which corresponds to a family of GO terms, not
>>>> just one GO term.
>>>>
>>>>>    Here's a
>>>>> demonstration of related queries:
>>>>>
>>>>>> get("GO:0003700", GOTERM)
>>>>> GOID: GO:0003700
>>>>> Term: sequence-specific DNA binding transcription factor activity
>>>>> Ontology: MF
>>>>> Definition: Interacting selectively and non-covalently with a specific
>>>>>      DNA sequence in order to modulate transcription. The transcription
>>>>>      factor may or may not also interact selectively with a protein or
>>>>>      macromolecular complex.
>>>>> Synonym: GO:0000130
>>>>> Secondary: GO:0000130
>>>>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>>>>> length(tfg)
>>>>> [1] 940
>>>> This works fine in case of one GO term (GO: 0003700). Is there a similar
>>>> function like getterm("transcription", revmap()) to get all the genes
>>>> that
>>>> their GO terms contain *transcription*?
>>> As far as I know there is no such function.  Solutions depend on the
>>> type of wild-card you are looking for.  If SQL LIKE operator is
>>> adequate
>>>
>>> termlikeToTags = function (liketok)
>>> {
>>>      require(GO.db)
>>>      gcon = GO_dbconn()
>>>      as.character(dbGetQuery(gcon, paste("select go_id from go_term
>>> where term like '%",
>>>          liketok, "%'", sep = ""))[[1]])
>>> }
>>>
>>> if you want to use full regular expressions
>>>
>>> reToTags = function (retok, ...) {
>>>      require(GO.db)
>>>      allt = dbGetQuery(GO_dbconn(), "select  go_id, term from go_term")
>>>      inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...)
>>> # will error if value set at call
>>>      if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
>>>      stop("retok did not grep to any term")
>>> }
>>>
>>>> length(reToTags("transcription"))
>>> [1] 372
>>>> length(termlikeToTags("transcription"))
>>> [1] 372
>>>> length(reToTags("transcrip.*"))
>>> [1] 411
>>>> length(termlikeToTags("transcrip.*"))
>>> [1] 0
>>>
>> As far as I understand, these functions will list out all the GO Terms
>> containing "transcription" for example. In order to search for all of the
>> genes in the database that have these GO Terms, I suppose I will have to
>> loop
>>
>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>> length(tfg)
>>
>> for each of the term and then sum them up? Since your %annTo% function is to
>> check a certain gene list to see if any of them has the specified term,
>> would it be better to just run that function to the available genes in the
>> database? How do I list and use the gene list in the database?
> I'd suggest you read the vignettes at
> http://www.bioconductor.org/help/bioc-views/release/bioc/html/AnnotationDbi.html,
> particularly http://www.bioconductor.org/packages/2.7/bioc/vignettes/AnnotationDbi/inst/doc/AnnotationDbi.pdf
>
> The codes I provided are hints in the direction of solutions for the
> family of questions you pose.
>
>>>>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
>>>>> revmap reverses this mapping and takes a tag to the set of genes
>>>>> mapped to the tag by entrez.
>>>>>
>>>>> Now, to return to the first question -- it isn't simple and a lot of
>>>>> presuppositions have to be made explicit.  One of the most problematic
>>>>> is the commitment to use gene symbols.  If you don't read the docs
>>>>> about bioconductor annotation and R packages pertaining thereto, it's
>>>>> hard to make progress.
>>>> Can you elaborate a little bit more why using gene symbols is
>>>> problematic?
>>>> And if it is really a problem, what gene ID I should use to avoid that
>>>> problem?
>>> Entrez gene IDs are generally more specific.  If you haven't run into
>>> collisions and ambiguities with symbols by now, maybe you don't have
>>> to worry about it.
>>>
>> Well, our reference database is UCSC refFlat downloaded from UCSC genome
>> browser, and in that format there is only geneName (SYMBOL) and name
>> (refSEQ). There is no Entrez Gene ID in that file. We actually have a
>> problem of one same geneName (SYMBOL) has more than one isoforms (more than
>> one refSEQ), but in fact we dont have a good way of differentiate between
>> those (by mathematical methods). So we use either gene symbol or refSeq ID.
>>
> most of our annotation packages have mappings that employ refseq identifiers.
> the functions i provided can be readily altered to use refseq ids as
> keys, and if you
> introduce types for the identifier/term tokens, the functions can be
> implemented as methods
> that perform appropriately for inputs from different vocabularies.
> but you will have to
> learn how to use the mapping objects or the sqlite tables
> productively.  Package GSEABase is
> also a useful resource for working with collections of gene identifiers.
>

Thanks for both of your suggestion Vincent. I will definitely check them 
out. Your suggestions/hints do help me a lot :).

Bests,

D.