[BioC] Bug in hyperGTest for KEGGHyperGParams?

Thu Jul 7 18:48:49 CEST 2011

Hi Jenny,

Yes and yes.  And we don't seem to actually disagree about GO at all.  
That is what I meant by the "DAG nature of the GO annotations".

   Marc

On 07/05/2011 03:22 PM, Jenny Drnevich wrote:
> Hi James (and Marc),
>
> Shame on me for not searching the archives. Thanks for pointing out 
> your previous thread with Marc. I do want to make sure I'm 
> understanding this properly. Using my demo results:
>
>> > hg.KEGG
>> $list1
>> Gene to KEGG test for over-representation
>> 190 KEGG ids tested (3 have p < 0.01)
>> Selected gene set size: 280
>> Gene universe size: 1629
>> Annotation package: porcine
>>
>> $list2
>> Gene to KEGG test for over-representation
>> 105 KEGG ids tested (1 have p < 0.01)
>> Selected gene set size: 54
>> Gene universe size: 1363
>> Annotation package: porcine
>
> So in list1, there were 280 selected genes that mapped to 190 KEGG 
> ids, and in the universe there were 1629 genes that mapped to these 
> same 190 KEGG ids? Where as in list2, there were 54 selected genes 
> that mapped to 105 KEGG ids and 1363 genes in the universe that mapped 
> to the same 105 KEGG ids, which is why the "Gene universe size" is 
> different for the two lists?
>
> Finally, I have to disagree with the reason why GO testing usually 
> (always?) gives the same number for the "Gene universe size" - it's 
> because once you have even a single gene map to any GO term, all the 
> parent terms are pulled in as well, including the root term for the 
> lineage - "biological process", "molecular function" or "cellular 
> component". While this is included, the GO universe will _never_ be 
> made smaller because any term in the universe that has a BP term will 
> also map to the root "biological process" term.
>
> Thanks,
> Jenny
>
>
>
>
>
> At 12:05 PM 7/5/2011, James F. Reid wrote:
>> Hi Jenny,
>>
>> I think you'll find the answer in this thread:
>> https://stat.ethz.ch/pipermail/bioconductor/2010-May/033439.html
>>
>> Best,
>> J.
>>
>> On 07/05/2011 06:13 PM, Jenny Drnevich wrote:
>>> Hi all,
>>>
>>> I'm doing both GO and KEGG over-representation testing on several
>>> different lists of genes, using the same background set for each list.
>>> What's got me puzzled is the difference in the "Gene universe size"
>>> reported from the hyperGTest results for each list from the KEGG test,
>>> even though they have the same background set. When I make a
>>> GOHyperGParams object for each list and test them, the results report
>>> the same "Gene universe size" for each list, which I assume to be the
>>> number of genes in the background that have any GO MF terms. However,
>>> for the KEGG test, each list reports a different "Gene universe size",
>>> so I'm unsure how selecting a different list from the same background
>>> can change the mapping of the background to KEGG terms. I haven't been
>>> able to get into the exact code of calling hyperGTest on a
>>> KEGGHyperGParams object, so I don't know what is going on - is it a 
>>> bug?
>>> Or for KEGG terms, is this supposed to happen? Reproducible example and
>>> sessionInfo() below.
>>>
>>> Thanks,
>>> Jenny
>>>
>>> > library(annaffy)
>>> Loading required package: Biobase
>>>
>>> Welcome to Bioconductor
>>>
>>> Vignettes contain introductory material. To view, type
>>> 'browseVignettes()'. To cite Bioconductor, see
>>> 'citation("Biobase")' and for packages 'citation("pkgname")'.
>>>
>>> Loading required package: GO.db
>>> Loading required package: AnnotationDbi
>>> Loading required package: DBI
>>>
>>> Loading required package: KEGG.db
>>>
>>> > library(porcine.db)
>>> Loading required package: org.Ss.eg.db
>>>
>>>
>>> > library(GOstats)
>>> Loading required package: Category
>>> Loading required package: graph
>>> >
>>> >
>>> > all.ids <- Rkeys(porcineENTREZID)
>>> > length(all.ids)
>>> [1] 30160
>>> >
>>> >
>>> > set.seed(1234)
>>> > list1 <- sample(all.ids,5000)
>>> > list2 <- list1[1:1000]
>>> > list3 <- list1[4501:5000]
>>> >
>>> > par.MF.list <- list(list1 = new("GOHyperGParams", geneIds = list1,
>>> universeGeneIds = all.ids,ontology="MF",
>>> + annotation="porcine.db", testDirection="over",
>>> pvalueCutoff=0.01,conditional=F),
>>> + list2 = new("GOHyperGParams", geneIds = list2, universeGeneIds =
>>> all.ids,ontology="MF",
>>> + annotation="porcine.db", testDirection="over",
>>> pvalueCutoff=0.01,conditional=F) ,
>>> + list3 = new("GOHyperGParams", geneIds = list3, universeGeneIds =
>>> all.ids,ontology="MF",
>>> + annotation="porcine.db", testDirection="over",
>>> pvalueCutoff=0.01,conditional=F))
>>> >
>>> > hg.MF.list <- lapply(par.MF.list,hyperGTest)
>>> > hg.MF.list
>>> $list1
>>> Gene to GO MF test for over-representation
>>> 1007 GO MF ids tested (1 have p < 0.01)
>>> Selected gene set size: 569
>>> Gene universe size: 3198
>>> Annotation package: porcine
>>>
>>> $list2
>>> Gene to GO MF test for over-representation
>>> 419 GO MF ids tested (6 have p < 0.01)
>>> Selected gene set size: 106
>>> Gene universe size: 3198
>>> Annotation package: porcine
>>>
>>> $list3
>>> Gene to GO MF test for over-representation
>>> 266 GO MF ids tested (2 have p < 0.01)
>>> Selected gene set size: 63
>>> Gene universe size: 3198
>>> Annotation package: porcine
>>>
>>> #Note the Gene universe size is 3198 for all 3 lists
>>>
>>> >
>>> >
>>> > par.KEGG <- list(list1 = new("KEGGHyperGParams", geneIds = list1,
>>> universeGeneIds = all.ids,
>>> + annotation="porcine.db", testDirection="over", pvalueCutoff=0.01),
>>> + list2= new("KEGGHyperGParams", geneIds = list2, universeGeneIds =
>>> all.ids,
>>> + annotation="porcine.db", testDirection="over", pvalueCutoff=0.01) ,
>>> + list3= new("KEGGHyperGParams", geneIds = list3, universeGeneIds =
>>> all.ids,
>>> + annotation="porcine.db", testDirection="over", pvalueCutoff=0.01) )
>>> >
>>> > hg.KEGG <- lapply(par.KEGG,hyperGTest)
>>> > hg.KEGG
>>> $list1
>>> Gene to KEGG test for over-representation
>>> 190 KEGG ids tested (3 have p < 0.01)
>>> Selected gene set size: 280
>>> Gene universe size: 1629
>>> Annotation package: porcine
>>>
>>> $list2
>>> Gene to KEGG test for over-representation
>>> 105 KEGG ids tested (1 have p < 0.01)
>>> Selected gene set size: 54
>>> Gene universe size: 1363
>>> Annotation package: porcine
>>>
>>> $list3
>>> Gene to KEGG test for over-representation
>>> 87 KEGG ids tested (1 have p < 0.01)
>>> Selected gene set size: 30
>>> Gene universe size: 1204
>>> Annotation package: porcine
>>>
>>> # Now there are 3 different Gene universe sizes: 1629, 1363 and 
>>> 1204. WHY?
>>>
>>> >
>>> >
>>> > sessionInfo()
>>> R version 2.13.0 (2011-04-13)
>>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>> States.1252 LC_MONETARY=English_United States.1252
>>> [4] LC_NUMERIC=C LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] GOstats_2.18.0 graph_1.30.0 Category_2.18.0 porcine.db_2.4.7
>>> org.Ss.eg.db_2.5.0 annaffy_1.24.0
>>> [7] KEGG.db_2.5.0 GO.db_2.5.0 RSQLite_0.9-4 DBI_0.2-5
>>> AnnotationDbi_1.14.1 Biobase_2.12.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.30.0 genefilter_1.34.0 GSEABase_1.14.0 RBGL_1.28.0
>>> splines_2.13.0 survival_2.36-5 tools_2.13.0
>>> [8] XML_3.4-0.2 xtable_1.5-6
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>