[BioC] GOstats suggestion

Mon Jun 4 18:20:00 CEST 2007

Hi Johannes,

You are right that the current Category/GOstats implementations rely
on Bioconductor annotation data packages being available.  Taking the
time to generate an annotation data package using AnnBuilder would
have other benefits aside from being able to use the GOstats code, but
I can sympathize with wanting a way to use these tools without going
through that step first.

I'm not opposed to the idea of finding a way to let the GOstats tools
operate without an annotation data package, but at present won't have
time to implement anything (what is there now suits our needs fairly
well).  So patches are welcome. :-)

"Johannes Rainer" <johannes.rainer at tcri.at> writes:
> thanks for your suggestion, this would be a solution,
> but as far as i understand the functions from the GOstats and Category
> packages map each time the hyperGTest function is called the submitted ids
> to GO terms using the annotation packages (i.e. hgu133plus2 annotation
> packages). actually the mapping is performed in the getGoToEntrezMap
> function (Category package), and this function maps EntrezGene IDs to GO
> terms by first mapping affy IDs to GO terms and then affy IDs to EntrezGene
> IDs.

Yes, the mapping is recomputed for each call and this could probably
be improved.  Indeed, as we transition to SQLite-based annotation data
packages, many of the contortions of the current code can be avoided
entirely.  I'm not sure we can avoid computing the mapping for each
call because we need to filter the mapping based on the provided list
of gene IDs.

> when i submit the EntrezGene IDs of the selected genes and those of the gene
> universe, i would not need the information from the annotation packages that
> map affy ids to entrezgene ids and affy ids to GO terms. the mapping between
> GO terms and EntrezGene IDs can be performed using the GO package
> i.e.
>
>     GOLL <- as.list(get("GOALLENTREZID",mode="environment"))
>     GOLL <- GOLL[!is.na(GOLL)] # just removing all the GO ids that are not
> mapped to any EntrezGene ID
>     PresentGO <- sapply(GOLL,function(z){
>         if(is.na(z) || length(z)==0)
>             return(FALSE)
>         any(x %in% z)            # x are EntrezGene IDs, either from the
> gene universe or the selected ones
>         }
>     )
>
>    GOLL <- GOLL[PresentGO]
>
> GOLL is than a list of all GO terms for the EntrezGene IDs specified with x
> (containing all ontologies, MF, CC and BP)

Aside:

  The GOALLENTREZID map should probably be replaced with organism
  and ontology specific maps.  The current map is huge and if we were
  to use it as you are suggesting, I suspect it would be even slower
  than the current map genertion to go through and selected the
  desired ontology, eliminate GO IDs with no annotations in the
  selected gene list, etc.

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org