[BioC] fastest way to get a gene list having certain GO term

Wed Mar 7 18:19:25 CET 2012

On 3/6/12 5:24 PM, Duke wrote:
> On 3/6/12 3:47 PM, Duke wrote:
>> Hi folks,
>>
>> I need some statistics for a certain GO term (for example, "DNA 
>> binding"), and I wonder what is the fastest way to archive the latest 
>> list of genes having that specific GO term. There are now a number of 
>> GO packages and I would like to hear/learn your experience regarding 
>> various different packages.
>
> To archive the above task, I separated it as two processes:
>
> * Get all GO IDs having specific term ("DNA binding")
> * Then, get all the genes having the resulting GO IDs
>
> I think I got the numbers now:
>
> library("GO.db")
> library("org.Hs.eg.db")
>
> GOTerm2GOID = function(term){
>   GTL = eapply(GOTERM, function(x){grep(term, x at Term, value=TRUE)})
>   GID = sapply(GTL, length)
>   names(GTL[GID > 0])
> }
>
> length(unlist(sapply(GOTerm2GOID("DNA binding"), function(x) mget(x, 
> revmap(org.Hs.egGO), ifnotfound=NA))))
> 4265
>
> However, I am still stuck at how to get the gene symbols (Il22, Foxp3 
> for example) as well as RefSeq ID of the resulting gene list.

Thanks David for your suggestion. I had some quick readings on 
org.Hs.eg.db as you suggested and finally came up with:

library("GO.db")
library("org.Hs.eg.db")

GOTerm2GOID <- function(term){
   GTL = eapply(GOTERM, function(x){grep(term, x at Term, value=TRUE)})
   GID = sapply(GTL, length)
   names(GTL[GID > 0])
}

ENTREZ2GENES <- function(entrezList=NULL, geneList=NULL){
   geneVec = as.vector(unlist(sapply(entrezList, 
function(x){geneList[names(geneList)==x]})))
   geneVec = geneVec[order(geneVec)]
   geneVec = geneVec[!duplicated(geneVec)]
   geneVec
}

## get all gene entrez ids
d = as.data.frame(unlist(sapply(GOTerm2GOID("RNA binding"), function(x) 
mget(x, revmap(org.Hs.egGO), ifnotfound=NA))))
geneIDs = as.vector(d[!is.na(d[[1]]),])
geneIDs = geneIDs[order(geneIDs)]
geneIDs = geneIDs[!duplicated(geneIDs)]
length(geneIDs)

## get all refseq IDs
refseqDat <- org.Hs.egREFSEQ
mapped_genes <- mappedkeys(refseqDat)
refseqDatList <- as.list(refseqDat[mapped_genes])
length(ENTREZ2GENES(entrezList=geneIDs, geneList=refseqDatList))

## get all gene symbol
geneSymDat <- org.Hs.egSYMBOL
mapped_genes <- mappedkeys(geneSymDat)
geneSymDatList <- as.list(geneSymDat[mapped_genes])
length(ENTREZ2GENES(entrezList=geneIDs, geneList=geneSymDatList))

I think the code works fine, and I get what I need, but I still believe 
there should be a faster way to achieve the same thing. I would like to 
hear more from experienced users.

Thanks,

D.