[BioC] Hypergeometric test with Disease Ontology

Vincent Carey stvjc at channing.harvard.edu
Fri Jan 28 21:25:26 CET 2011


This has been an interesting discussion.  A marginally connected
question: Are there tools to normalize found lists of symptoms that
use informal terminology so that they can be matched to formally
catalogued terms in DO?

On Fri, Jan 28, 2011 at 9:57 AM, Ted Morrow <ted.morrow at ebc.uu.se> wrote:
> Hi Gilbert,
>
> Great thanks for the clarification....OK building myGA works fine and I have
> setAnnLib to org.Hs.eg.db with:
> myGA2 <- setAnnLib(myGA, 'org.Hs.eg.db')
> getAnnLib(myGA2)
> [1] "org.Hs.eg.db"
>
> .....but at the geneAnswerReadable step:
> myGAreadable <- geneAnswersReadable(myGA2, catTerm=FALSE)
>
> I get the following error:
> [1] "Mapping geneInput ..."
> [1] "Warning: some genes do not have valid symbols!"
> [1] "Mapping genesInCategory ..."
>
>  traceback()
> 2: stop(paste("The input geneAnswers categoryType is not DOLite but ",
>       x at categoryType, ". stop function!"))
> 1: topDOLiteGenes(myGA, orderby = "pvalue", topGenes = "ALL", genesOrderBy =
> "pValue",
>       file = TRUE)
>
> Cheers
> Ted
>
> sessionInfo()
> R version 2.12.1 (2010-12-16)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>                 LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] org.Hs.eg.db_2.4.6   GeneAnswers_1.6.0    RColorBrewer_1.0-2
> Heatplus_1.20.0      rgl_0.92.798         MASS_7.3-9           RSQLite_0.9-4
>        DBI_0.2-5            XML_3.2-0.2          annotate_1.28.0
>  AnnotationDbi_1.12.0
> [12] Biobase_2.10.0       RCurl_1.5-0.1        bitops_1.0-4.1
> igraph_0.5.5-1
>
> loaded via a namespace (and not attached):
> [1] graph_1.28.0     grid_2.12.1      RBGL_1.26.0      Rgraphviz_1.27.0
> tools_2.12.1     xtable_1.5-6
>>
>
>
>
> On 1/28/2011 3:40 PM, Gilbert Feng wrote:
>>
>> Hi, Ted
>>
>> Yes, your codes about mygenepool and mygenes are correct. I didn't clearly
>> state them in my previous email. Sorry about that. myDOLite should be built
>> by DOLite and mygenepool, not mygenes. So currently, myDOLite is a
>> customized annotation library and when you build a geneAnswers instance,
>> leave categoryType to NULL(default value), and in your case, I think the
>> total gene number should be 1310, not 1202, since the former is the gene
>> number of mygenepool.
>>
>> after buiding myGA, run setAnnLib to set it to org.Hs.eg.db I guess the
>> error is that you don't set it.
>>
>> For the further mapping function, like geneAnswerReadable or other
>> visualization, always set catTerm=FALSE.
>>
>> Let me know if it doesn't work.
>>
>> Best
>>
>> Gilbert
>>
>>
>>> I have two gene pools - one is a universe of genes (called say
>>> "mygenepool") which is only a subset on all human genes, the other is a
>>> subset of mygenepool (called say "mygenes"). I am interested in knowing
>>> which DO terms show over-representation in mygenes given the DO terms
>>> present in mygenepool. So, in your example I am a bit confused as to
>>> which pool "mygenes" refers to. Is it "mygenepool" or "mygenes"?
>>>
>>> So I have......
>>> str(mygenepool) # I used as.matrix then as.vector to change the
>>> dataframe into a vector...is that OK?
>>> int [1:1310] 8813 3075 2729 8379 204 170302 10165 6521 799 3052 ...
>>>
>>> and
>>> str(mygenes)
>>> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>>>
>>>
>>> and mygenes is a subset of mygenepool:
>>> overlap <- intersect(mygenepool, mygenes)
>>> str(overlap)
>>> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>>>
>>>
>>> Running the first bit of your example code seems to work fine:
>>> myDOLite <- lapply(DOLite, intersect, mygenepool)
>>> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>>>
>>> and so does creating a GeneAnswers instance (myGA) whether I specify the
>>> totalGeneNumber or not:
>>> myGA <- geneAnswersBuilder(mygenes, myDOLite, categoryType='DOLITE',
>>> testType='hyperG',
>>> #known=FALSE, totalGeneNumber=1202,
>>> pvalueT=0.1, geneExpressionProfile=NULL, verbose=TRUE)
>>>
>>> But then I get the following error:
>>> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
>>> [1] "Mapping geneInput ..."
>>> Error in switch(sub("org.*[:.:]", "", libname), eg = "EG", tair = "TAIR",
>>> :
>>> EXPR must be a length 1 vector
>>>
>>> traceback()
>>> 2: getSymbols(as.vector(x at geneInput[, 1]), x at annLib, strict = strict,
>>> missing = missing)
>>> 1: geneAnswersReadable(myGA, verbose = TRUE)
>>>
>>> If I instead create the GeneAnswers instance with 'org.Hs.eg.db' and not
>>> myDOLite the step.....
>>> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
>>>
>>> .....works fine as do later steps (for example geneAnsersChartPlots and
>>> topDOLiteGenes)
>>>
>>> But then presumably I am not working with the restricted pool of genes
>>> that I want to (i.e. mygenepool)
>>>
>>> Any ideas?
>>> Cheers
>>> Ted
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 1/27/2011 5:19 PM, Gilbert Feng wrote:
>>>>
>>>> Hi, Ted
>>>>
>>>> If I correctly understand your question, you have your own gene pool,
>>>> don't you? In that case, GeneAnswers still can handle that. What you
>>>> need is to build a customized DOLite list to run analysis. For example,
>>>>
>>>> if your genes are mygenes(an unique Entrez GENE ID vector)
>>>>
>>>> myDOLite <- lapply(DOLite, intersect, mygenes)
>>>> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>>>> #Please note that the parameter 'known' is TRUE in geneAnswersBuilder,
>>>> which means the test will only use shared genes in mygenes and DOLite
>>>> genes as the gene pool to run hypergeometric test. You can change the
>>>> number of gene pool by set parameter 'totalGeneNumber' and set 'known'
>>>> to FALSE at the same time .
>>>> myGA <- geneAnswersBuilder(mygenes, myDOLite, testType='hyperG', ...)
>>>>
>>>> since DO is designed for human, you can use setAnnLib to set the
>>>> species as 'org.Hs.eg.db' for further visualization or analysis.
>>>>
>>>> Let me know if you have any question.
>>>>
>>>> Gilbert
>>>>
>>>> BTW, the current one has a new function to automatically generate a
>>>> report for a or several groups of genes. Please check "groupReport"
>>>> page, but before you run it, make sure you have java support and flash
>>>> installation.
>>>>
>>>>
>>>> On 1/27/11 9:45 AM, Ted Morrow wrote:
>>>>>
>>>>> Thanks for your replies Gilbert and Peter.....I forgot to say in my
>>>>> original question that I did look into using both the GeneAnswers
>>>>> package and the FunDO web-based system but the issue I have not been
>>>>> able to solve with them is that I cannot specify "my gene Universe"
>>>>> within which my selected genes are tested for enrichment. I want to see
>>>>> which diseases are enriched out of the possible diseases in "my gene
>>>>> universe" which is a subset of all possible human genes.
>>>>>
>>>>> Is that possible in GeneAnswers (or even FunDO)?
>>>>>
>>>>> Cheers
>>>>> Ted
>>>>>
>>>>> On 1/27/2011 4:39 PM, Peter Robinson wrote:
>>>>>>
>>>>>> On 01/27/2011 04:17 PM, Gilbert Feng wrote:
>>>>>>>
>>>>>>> Hi, Ted
>>>>>>>
>>>>>>> DO.db is a standard sqlite file and you can use standard RSQLite
>>>>>>> procedure to retrieve the information. Actually we do have a lite
>>>>>>> version of Disease Ontology, DOLite that removes some redundant
>>>>>>> nodes,
>>>>>>> integrated in GeneAnswers package, which also uses hypergeometric
>>>>>>> test
>>>>>>> to run enrichment analysis as well as automatically generated
>>>>>>> interactive(cytoscape web support) html summary for one or more
>>>>>>> groups
>>>>>>> of genes.
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Gilbert
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> You might also want to take a look at this website:
>>>>>> http://django.nubic.northwestern.edu/fundo/faq
>>>>>> which implements enrichment analysis using genes annotated to DO
>>>>>> terms.
>>>>>> -peter
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On 1/27/11 5:00 AM, Ted Morrow wrote:
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I would like to conduct a hypergeometric test on a list of genes but
>>>>>>>> using Disease Ontology instead of GO or KEGG terms. The package
>>>>>>>> "DO.db"
>>>>>>>> contains this information but I have been unable to find a way of
>>>>>>>> using
>>>>>>>> this database in conjunction with the "GOstats" package that I have
>>>>>>>> been
>>>>>>>> using.
>>>>>>>>
>>>>>>>> Has anyone attempted to run a hypergeometric test for Disease
>>>>>>>> Ontology
>>>>>>>> terms? Is there another package I could use? Or is there a way of
>>>>>>>> modifying the argument (if that's the right word) GOHyperGParams in
>>>>>>>> GOstats so that it can make use of the information in DO.db?
>>>>>>>>
>>>>>>>> Both the GO and KEGG analyses work fine:
>>>>>>>> ### GO chunk
>>>>>>>> params<- new("GOHyperGParams",
>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>> annotation = "hgu95av2.db", ontology="BP",
>>>>>>>> pvalueCutoff=0.01, conditional=FALSE,
>>>>>>>> testDirection="over")
>>>>>>>> hgOver<- hyperGTest(params)
>>>>>>>>
>>>>>>>> hgOver
>>>>>>>> Gene to GO BP test for over-representation
>>>>>>>> 2136 GO BP ids tested (15 have p< 0.01)
>>>>>>>> Selected gene set size: 112
>>>>>>>> Gene universe size: 951
>>>>>>>> Annotation package: hgu95av2
>>>>>>>>
>>>>>>>> ### KEGG chunk
>>>>>>>> paramsKEGG<- new("KEGGHyperGParams",
>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>> annotation = "hgu95av2.db",
>>>>>>>> pvalueCutoff=0.01,
>>>>>>>> testDirection="over")
>>>>>>>>
>>>>>>>> hgOverKEGG<- hyperGTest(paramsKEGG)
>>>>>>>> hgOverKEGG
>>>>>>>> summary(hgOverKEGG)
>>>>>>>>
>>>>>>>> My data looks like this:
>>>>>>>> str(selectedEntrezIds)
>>>>>>>> chr [1:157] "60528" "6853" "10157" "5081" "389434" "6591" "7414"
>>>>>>>> "7546"
>>>>>>>> "3074" "6916" "6559" "23503" "8626" "6557" "38" "60" "9733" "113235"
>>>>>>>> "28962" "10269" "4069" "30835" "7018" ...
>>>>>>>>
>>>>>>>> > str(entrezUniverse)
>>>>>>>> chr [1:1310] "8813" "3075" "2729" "8379" "204" "170302" "10165"
>>>>>>>> "6521"
>>>>>>>> "799" "3052" "1387" "5244" "3674" "6833" "10083" "60528" "8842"
>>>>>>>> "5048"
>>>>>>>> "4843" "6329" "5080" "6401" "6853" ...
>>>>>>>>
>>>>>>>> My naive attempts to use DO have included:
>>>>>>>> paramsDO<- new("DOHyperGParams",
>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>> annotation = "DO.db",
>>>>>>>> pvalueCutoff=0.01,
>>>>>>>> testDirection="over")
>>>>>>>>
>>>>>>>> Which of course doesn't work and gives the following error:
>>>>>>>> Error in getClass(Class, where = topenv(parent.frame())) :
>>>>>>>> "DOHyperGParams" is not a defined class
>>>>>>>>
>>>>>>>> > traceback()
>>>>>>>> 3: stop(gettextf("\"%s\" is not a defined class", Class), domain =
>>>>>>>> NA)
>>>>>>>> 2: getClass(Class, where = topenv(parent.frame()))
>>>>>>>> 1: new("DOHyperGParams", geneIds = selectedEntrezIds,
>>>>>>>> universeGeneIds =
>>>>>>>> entrezUniverse,
>>>>>>>> annotation = "DO.db", pvalueCutoff = 0.01, testDirection = "over")
>>>>>>>>
>>>>>>>>
>>>>>>>> Replacing "GOHyperGParams" with "DOHyperGParams" also gives the
>>>>>>>> following error:..
>>>>>>>>
>>>>>>>> hgOverDO<- hyperGTest(paramsDO)
>>>>>>>> Error in match.arg(ontology, c("BP", "CC", "MF")) :
>>>>>>>> 'arg' should be one of “BP”, “CC”, “MF”
>>>>>>>>
>>>>>>>> traceback()
>>>>>>>> 10: stop(gettextf("'arg' should be one of %s",
>>>>>>>> paste(dQuote(choices),
>>>>>>>> collapse = ", ")), domain = NA)
>>>>>>>> 9: match.arg(ontology, c("BP", "CC", "MF"))
>>>>>>>> 8: getUniverseViaGo(p)
>>>>>>>> 7: universeBuilder(p)
>>>>>>>> 6: universeBuilder(p)
>>>>>>>> 5: .hyperGTestInternal(p)
>>>>>>>> 4: is(object, Cl)
>>>>>>>> 3: is(object, Cl)
>>>>>>>> 2: .valueClassTest(standardGeneric("hyperGTest"),
>>>>>>>> "HyperGResultBase",
>>>>>>>> "hyperGTest")
>>>>>>>> 1: hyperGTest(paramsDO)
>>>>>>>>
>>>>>>>>
>>>>>>>> Any help would be greatly appreciated.
>>>>>>>> /Ted
>>>>>>>>
>>>>>>>> > sessionInfo()
>>>>>>>> R version 2.12.1 (2010-12-16)
>>>>>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>>>>>
>>>>>>>> locale:
>>>>>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>>>>>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>>>>>>> LC_TIME=English_United States.1252
>>>>>>>>
>>>>>>>> attached base packages:
>>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>>
>>>>>>>> other attached packages:
>>>>>>>> [1] KEGG.db_2.4.5 DO.db_2.1.0 GO.db_2.4.5 hgu95av2.db_2.4.5
>>>>>>>> org.Hs.eg.db_2.4.6 GOstats_2.16.0 RSQLite_0.9-4 DBI_0.2-5
>>>>>>>> graph_1.28.0
>>>>>>>> Category_2.16.0 AnnotationDbi_1.12.0
>>>>>>>> [12] Biobase_2.10.0
>>>>>>>>
>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>> [1] annotate_1.28.0 genefilter_1.32.0 GSEABase_1.12.2 RBGL_1.26.0
>>>>>>>> splines_2.12.1 survival_2.36-2 tools_2.12.1 XML_3.2-0.2 xtable_1.5-6
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> --
> __________________________________________________________________________________
>
> Edward H. Morrow
> Department of Ecology and Evolution
> Animal Ecology (zooekologi)
> Evolutionary Biology Centre
> Uppsala University
> Norbyvägen 18-D
> SE-752 36 Uppsala
> SWEDEN
>
> Email: ted.morrow at ebc.uu.se
> Tel: +46 18 471 2676
> Fax +46 18 471 6484
> Webpage:
> http://www.iee.uu.se/zooekol/default.php?type=personalpage&id=119&lang=en
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list