[BioC] Hypergeometric test with Disease Ontology

Gilbert Feng g-feng at northwestern.edu
Fri Jan 28 15:40:20 CET 2011


Hi, Ted

Yes, your codes about mygenepool and mygenes are correct. I didn't 
clearly state them in my previous email. Sorry about that. myDOLite 
should be built by DOLite and mygenepool, not mygenes. So currently, 
myDOLite is a customized annotation library and when you build a 
geneAnswers instance, leave categoryType to NULL(default value), and in 
your case, I think the total gene number should be 1310, not 1202, since 
the former is the gene number of mygenepool.

after buiding myGA, run setAnnLib to set it to org.Hs.eg.db I guess the 
error is that you don't set it.

For the further mapping function, like geneAnswerReadable or other 
visualization, always set catTerm=FALSE.

Let me know if it doesn't work.

Best

Gilbert


> I have two gene pools - one is a universe of genes (called say
> "mygenepool") which is only a subset on all human genes, the other is a
> subset of mygenepool (called say "mygenes"). I am interested in knowing
> which DO terms show over-representation in mygenes given the DO terms
> present in mygenepool. So, in your example I am a bit confused as to
> which pool "mygenes" refers to. Is it "mygenepool" or "mygenes"?
>
> So I have......
> str(mygenepool) # I used as.matrix then as.vector to change the
> dataframe into a vector...is that OK?
> int [1:1310] 8813 3075 2729 8379 204 170302 10165 6521 799 3052 ...
>
> and
> str(mygenes)
> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>
>
> and mygenes is a subset of mygenepool:
> overlap <- intersect(mygenepool, mygenes)
> str(overlap)
> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>
>
> Running the first bit of your example code seems to work fine:
> myDOLite <- lapply(DOLite, intersect, mygenepool)
> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>
> and so does creating a GeneAnswers instance (myGA) whether I specify the
> totalGeneNumber or not:
> myGA <- geneAnswersBuilder(mygenes, myDOLite, categoryType='DOLITE',
> testType='hyperG',
> #known=FALSE, totalGeneNumber=1202,
> pvalueT=0.1, geneExpressionProfile=NULL, verbose=TRUE)
>
> But then I get the following error:
> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
> [1] "Mapping geneInput ..."
> Error in switch(sub("org.*[:.:]", "", libname), eg = "EG", tair = "TAIR", :
> EXPR must be a length 1 vector
>
> traceback()
> 2: getSymbols(as.vector(x at geneInput[, 1]), x at annLib, strict = strict,
> missing = missing)
> 1: geneAnswersReadable(myGA, verbose = TRUE)
>
> If I instead create the GeneAnswers instance with 'org.Hs.eg.db' and not
> myDOLite the step.....
> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
>
> .....works fine as do later steps (for example geneAnsersChartPlots and
> topDOLiteGenes)
>
> But then presumably I am not working with the restricted pool of genes
> that I want to (i.e. mygenepool)
>
> Any ideas?
> Cheers
> Ted
>
>
>
>
>
>
> On 1/27/2011 5:19 PM, Gilbert Feng wrote:
>> Hi, Ted
>>
>> If I correctly understand your question, you have your own gene pool,
>> don't you? In that case, GeneAnswers still can handle that. What you
>> need is to build a customized DOLite list to run analysis. For example,
>>
>> if your genes are mygenes(an unique Entrez GENE ID vector)
>>
>> myDOLite <- lapply(DOLite, intersect, mygenes)
>> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>> #Please note that the parameter 'known' is TRUE in geneAnswersBuilder,
>> which means the test will only use shared genes in mygenes and DOLite
>> genes as the gene pool to run hypergeometric test. You can change the
>> number of gene pool by set parameter 'totalGeneNumber' and set 'known'
>> to FALSE at the same time .
>> myGA <- geneAnswersBuilder(mygenes, myDOLite, testType='hyperG', ...)
>>
>> since DO is designed for human, you can use setAnnLib to set the
>> species as 'org.Hs.eg.db' for further visualization or analysis.
>>
>> Let me know if you have any question.
>>
>> Gilbert
>>
>> BTW, the current one has a new function to automatically generate a
>> report for a or several groups of genes. Please check "groupReport"
>> page, but before you run it, make sure you have java support and flash
>> installation.
>>
>>
>> On 1/27/11 9:45 AM, Ted Morrow wrote:
>>> Thanks for your replies Gilbert and Peter.....I forgot to say in my
>>> original question that I did look into using both the GeneAnswers
>>> package and the FunDO web-based system but the issue I have not been
>>> able to solve with them is that I cannot specify "my gene Universe"
>>> within which my selected genes are tested for enrichment. I want to see
>>> which diseases are enriched out of the possible diseases in "my gene
>>> universe" which is a subset of all possible human genes.
>>>
>>> Is that possible in GeneAnswers (or even FunDO)?
>>>
>>> Cheers
>>> Ted
>>>
>>> On 1/27/2011 4:39 PM, Peter Robinson wrote:
>>>> On 01/27/2011 04:17 PM, Gilbert Feng wrote:
>>>>> Hi, Ted
>>>>>
>>>>> DO.db is a standard sqlite file and you can use standard RSQLite
>>>>> procedure to retrieve the information. Actually we do have a lite
>>>>> version of Disease Ontology, DOLite that removes some redundant nodes,
>>>>> integrated in GeneAnswers package, which also uses hypergeometric test
>>>>> to run enrichment analysis as well as automatically generated
>>>>> interactive(cytoscape web support) html summary for one or more groups
>>>>> of genes.
>>>>>
>>>>> Best
>>>>>
>>>>> Gilbert
>>>>
>>>>
>>>>
>>>>
>>>> You might also want to take a look at this website:
>>>> http://django.nubic.northwestern.edu/fundo/faq
>>>> which implements enrichment analysis using genes annotated to DO terms.
>>>> -peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> On 1/27/11 5:00 AM, Ted Morrow wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I would like to conduct a hypergeometric test on a list of genes but
>>>>>> using Disease Ontology instead of GO or KEGG terms. The package
>>>>>> "DO.db"
>>>>>> contains this information but I have been unable to find a way of
>>>>>> using
>>>>>> this database in conjunction with the "GOstats" package that I have
>>>>>> been
>>>>>> using.
>>>>>>
>>>>>> Has anyone attempted to run a hypergeometric test for Disease
>>>>>> Ontology
>>>>>> terms? Is there another package I could use? Or is there a way of
>>>>>> modifying the argument (if that's the right word) GOHyperGParams in
>>>>>> GOstats so that it can make use of the information in DO.db?
>>>>>>
>>>>>> Both the GO and KEGG analyses work fine:
>>>>>> ### GO chunk
>>>>>> params<- new("GOHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "hgu95av2.db", ontology="BP",
>>>>>> pvalueCutoff=0.01, conditional=FALSE,
>>>>>> testDirection="over")
>>>>>> hgOver<- hyperGTest(params)
>>>>>>
>>>>>> hgOver
>>>>>> Gene to GO BP test for over-representation
>>>>>> 2136 GO BP ids tested (15 have p< 0.01)
>>>>>> Selected gene set size: 112
>>>>>> Gene universe size: 951
>>>>>> Annotation package: hgu95av2
>>>>>>
>>>>>> ### KEGG chunk
>>>>>> paramsKEGG<- new("KEGGHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "hgu95av2.db",
>>>>>> pvalueCutoff=0.01,
>>>>>> testDirection="over")
>>>>>>
>>>>>> hgOverKEGG<- hyperGTest(paramsKEGG)
>>>>>> hgOverKEGG
>>>>>> summary(hgOverKEGG)
>>>>>>
>>>>>> My data looks like this:
>>>>>> str(selectedEntrezIds)
>>>>>> chr [1:157] "60528" "6853" "10157" "5081" "389434" "6591" "7414"
>>>>>> "7546"
>>>>>> "3074" "6916" "6559" "23503" "8626" "6557" "38" "60" "9733" "113235"
>>>>>> "28962" "10269" "4069" "30835" "7018" ...
>>>>>>
>>>>>> > str(entrezUniverse)
>>>>>> chr [1:1310] "8813" "3075" "2729" "8379" "204" "170302" "10165"
>>>>>> "6521"
>>>>>> "799" "3052" "1387" "5244" "3674" "6833" "10083" "60528" "8842"
>>>>>> "5048"
>>>>>> "4843" "6329" "5080" "6401" "6853" ...
>>>>>>
>>>>>> My naive attempts to use DO have included:
>>>>>> paramsDO<- new("DOHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "DO.db",
>>>>>> pvalueCutoff=0.01,
>>>>>> testDirection="over")
>>>>>>
>>>>>> Which of course doesn't work and gives the following error:
>>>>>> Error in getClass(Class, where = topenv(parent.frame())) :
>>>>>> "DOHyperGParams" is not a defined class
>>>>>>
>>>>>> > traceback()
>>>>>> 3: stop(gettextf("\"%s\" is not a defined class", Class), domain =
>>>>>> NA)
>>>>>> 2: getClass(Class, where = topenv(parent.frame()))
>>>>>> 1: new("DOHyperGParams", geneIds = selectedEntrezIds,
>>>>>> universeGeneIds =
>>>>>> entrezUniverse,
>>>>>> annotation = "DO.db", pvalueCutoff = 0.01, testDirection = "over")
>>>>>>
>>>>>>
>>>>>> Replacing "GOHyperGParams" with "DOHyperGParams" also gives the
>>>>>> following error:..
>>>>>>
>>>>>> hgOverDO<- hyperGTest(paramsDO)
>>>>>> Error in match.arg(ontology, c("BP", "CC", "MF")) :
>>>>>> 'arg' should be one of “BP”, “CC”, “MF”
>>>>>>
>>>>>> traceback()
>>>>>> 10: stop(gettextf("'arg' should be one of %s", paste(dQuote(choices),
>>>>>> collapse = ", ")), domain = NA)
>>>>>> 9: match.arg(ontology, c("BP", "CC", "MF"))
>>>>>> 8: getUniverseViaGo(p)
>>>>>> 7: universeBuilder(p)
>>>>>> 6: universeBuilder(p)
>>>>>> 5: .hyperGTestInternal(p)
>>>>>> 4: is(object, Cl)
>>>>>> 3: is(object, Cl)
>>>>>> 2: .valueClassTest(standardGeneric("hyperGTest"), "HyperGResultBase",
>>>>>> "hyperGTest")
>>>>>> 1: hyperGTest(paramsDO)
>>>>>>
>>>>>>
>>>>>> Any help would be greatly appreciated.
>>>>>> /Ted
>>>>>>
>>>>>> > sessionInfo()
>>>>>> R version 2.12.1 (2010-12-16)
>>>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>>>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>>>>> LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] KEGG.db_2.4.5 DO.db_2.1.0 GO.db_2.4.5 hgu95av2.db_2.4.5
>>>>>> org.Hs.eg.db_2.4.6 GOstats_2.16.0 RSQLite_0.9-4 DBI_0.2-5
>>>>>> graph_1.28.0
>>>>>> Category_2.16.0 AnnotationDbi_1.12.0
>>>>>> [12] Biobase_2.10.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] annotate_1.28.0 genefilter_1.32.0 GSEABase_1.12.2 RBGL_1.26.0
>>>>>> splines_2.12.1 survival_2.36-2 tools_2.12.1 XML_3.2-0.2 xtable_1.5-6
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> .
>>>>>
>>>>
>>>>
>>>
>>
>

-- 
-----------------------------------------------
Gang (Gilbert) Feng, PhD
Biomedical Informatics Center
Robert H. Lurie Comprehensive Cancer Center
Northwestern University
750 N. Lake Shore Drive, 11th Floor(11-175e)
Chicago, IL  60611
Phone:312-503-2358
Email g-feng (at) northwestern.edu



More information about the Bioconductor mailing list