[BioC] disagreement between biomaRt and topGO

Steffen sdurinck at lbl.gov
Wed Sep 5 20:22:37 CEST 2007


Hi Dick,

I can't help you with the topGO part of your question but a faster way 
to get the EntrezGene identifiers associated with a certain GO id is as 
follows:

ids = getBM(c("entrezgene","go"), filters="go", values="GO:0006836", 
mart = mart)

It looks like there are indeed 32 EntrezGene identifiers associated with 
this term in the current release of Ensembl.

Cheers,
Steffen

Dick Beyer wrote:
> Hello,
>
> I have a puzzling problem with a disagreement between the numbers of EntrezGene IDs I get from biomaRt and what I get from topGO for the GO category GO:0006836, neurotransmitter transport. I would assume the mismatch would be due to different GO builds, but what I find seems too extreme for that.
>
> Let me give a little background to how I get my list of Entrez Gene IDs.
>
> I start with the file ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz
>
> from which I extract IPI, EG, Symbol, description, UniGene
>
> This eventually gives me a list of 22963 unique Entrez Gene IDs.  I process this list through biomaRt as follows:
>
> library(biomaRt)
> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
> get.go.biomart <- 
> getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart=mart)
>
> >From this data.frame I can determine that there are 32 unique Entrez Gene IDs in the Gene Ontology Biological Process category GO:0006836, neurotransmitter transport. Everything is fine so far.
>
> Now I set up things for a topGO (non-affy) calculation:
>
> gene2GO <- as.list(get.go.biomart$go)
> names(gene2GO) <- get.go.biomart$entrezgene
>
> geneList <- list()
> geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] %in% LL1027[,2])))
> names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2]
>
> This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of entries I have marked as significant by setting their values to 1.
>
> Next, I create a topGOdata Biological Process object:
>
> BP.GOdata <- list() 
> BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", allGenes = geneList[[1]],
> annot = annFUN.gene2GO, gene2GO = gene2GO))
>
> and submit this topGOdata object to the following:
>
>      test.stat.BP    <- list()
>      result.BP       <- list()
>      test.stat.BP[1] <- list(new("classicCount", testStatistic = GOFisherTest, name = "Fisher test"))
>      result.BP[1]    <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]]))
>      test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, name = "Fisher test", cutOff = 0.01))
>      result.BP[2]    <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]]))
>      test.stat.BP[3] <- list(new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio"))
>      result.BP[3]    <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]]))
>      l <- list(classic = score(result.BP[[1]]), elim = score(result.BP[[2]]), weight = score(result.BP[[3]]))
>      allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = "classic", top = length(score(result.BP[[1]])))
>
> My resulting data.frame, allRes.BP, shows that for category GO:0006836, there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() does correctly find that 4 of these are significant.
>
> This latest released version of topGO must be using the Mar 14 11:46:06 2007 build of GO, and I assume biomaRt is more current as I did the query just today.  If category GO:0006836, neurotransmitter transport, has grown from 11 EGs to 32 EGs between now and last March for Human, then everything is as it should be.  But this seems unlikely to me.  So I wonder if there is a bug in the topGO calculations.
>
> Here is my session:
>
>
> sessionInfo()
> R version 2.5.1 (2007-06-27) 
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] "tools"     "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"
>
> other attached packages:
>      goTools hgu133plus2   Rgraphviz geneplotter     lattice       topGO     SparseM    annotate          GO       graph     Biobase     biomaRt       RCurl         XML
>      "1.8.0"    "1.16.0"    "1.14.1"    "1.14.0"   "0.15-11"     "1.2.1"      "0.73"    "1.14.1"    "1.16.0"    "1.14.2"    "1.14.1"    "1.10.1"     "0.8-0"     "1.9-0"
>
> Thanks very much,
> Dick
> *******************************************************************************
> Richard P. Beyer, Ph.D.	University of Washington
> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>  			Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
> http://staff.washington.edu/~dbeyer
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list