[BioC] GO with R and Bioconductor

Sun Feb 27 00:40:28 CET 2011

Hi Vincent,

On 2/26/11 2:38 PM, Vincent Carey wrote:
> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com>  wrote:
>> Dear colleagues,
>>
>>   I used to download GO database from geneoncology.org and did some c++ coding to manipulate the data as I wished. Now I want to try my luck with R - Bioconductor. I have heard of tons of tools supporting GO such as GO.db, topGO, goseq, GOstats, biomart etc... and I have been reading their description and examples, but honestly I am overhelmed and dont really know which package I should use to fulfill my task. So please advise me how I can do the following two simple tasks:
>>
>>   1. I have a list of genes (with gene names from UCSC such as Foxp3 etc...). How do I filter this list to get genes that have certain GO term such as transcription factor?
> since you said it was a simple task, consider the simple solution
> involving the "%annTo%" operator, which tells whether the symbols on
> the left have been annotated to the term on the right:
>
>> c("FOXP3", "BRCA2") %annTo% "mammary"
> FOXP3 BRCA2
> FALSE  TRUE
>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
> FOXP3 BRCA2
>   TRUE FALSE
>
> you could use the named logical vectors generated in this way to
> perform the filtering you describe.  but see below.
>
>>   2. How do I know the capacity of the latest GO database on bioconductor, for example, how many genes available for mm9, and how many of them have GO term transcription factor?
> The "GO database" concerns the gene ontology, a structure of terms and
> relationships among them.  The association of GO terms to gene names
> for mouse is presented in various ways, but the most basic one is in
> the org.Mm.eg.db package.  With that, you could use
>
> org.Mm.eg()
>
> to find, among other statistics,
>
> org.Mm.egGO has 29984 mapped keys (of 63329 keys).  Your question
> concerning transcription factor mapping is not completely precise, and
> you might want to survey the family of GO terms to come up with a set
> of terms that meets your requirement.

You are right, I did not make it clear. What I wanted was exactly what 
you said: I need to get a list of genes that have a certain set of terms 
(for example *transcription*) - which corresponds to a family of GO 
terms, not just one GO term.

>    Here's a
> demonstration of related queries:
>
>> get("GO:0003700", GOTERM)
> GOID: GO:0003700
> Term: sequence-specific DNA binding transcription factor activity
> Ontology: MF
> Definition: Interacting selectively and non-covalently with a specific
>      DNA sequence in order to modulate transcription. The transcription
>      factor may or may not also interact selectively with a protein or
>      macromolecular complex.
> Synonym: GO:0000130
> Secondary: GO:0000130
>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>> length(tfg)
> [1] 940

This works fine in case of one GO term (GO: 0003700). Is there a similar 
function like getterm("transcription", revmap()) to get all the genes 
that their GO terms contain *transcription*?

> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
> revmap reverses this mapping and takes a tag to the set of genes
> mapped to the tag by entrez.
>
> Now, to return to the first question -- it isn't simple and a lot of
> presuppositions have to be made explicit.  One of the most problematic
> is the commitment to use gene symbols.  If you don't read the docs
> about bioconductor annotation and R packages pertaining thereto, it's
> hard to make progress.

Can you elaborate a little bit more why using gene symbols is 
problematic? And if it is really a problem, what gene ID I should use to 
avoid that problem?

>    My code defining "%annTo%" will follow -- it
> is not particularly efficient or well-designed, but it shows the
> components that have to be identified and used for this to work.  All
> the annotation is based on SQLite tables distributed with Bioconductor
> packages, with interfaces defined in DBI and RSQLite packages.  The
> %annTo% operator is defined below.  More thoughtful solutions are
> invited.
>
> sym2terms = function (sym)
> {
>      if (length(sym)>1) stop("scalar symbol input required") # bad form
>      require(GO.db)
>      require(org.Hs.eg.db)  # generalize
>      egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
>      gos = mget(egs, org.Hs.egGO)
>      goids = sapply(gos, lapply, "[[", "GOID")
>      sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
> }
>
> genesAnnotatedTo = function(syms, term="transcription factor",
>     meth=agrep) {
>     tms = lapply(syms, sym2terms)
>     chk = sapply(tms, function(z) length(agrep(term, z))>0)
>     names(chk) = syms
>     chk
> }
>
> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>

Thanks for this %annTo% function. It works perfectly!

Regards,

D.