[BioC] Bug in hyperGTest for KEGGHyperGParams?

Jenny Drnevich drnevich at illinois.edu
Wed Jul 6 00:22:52 CEST 2011


Hi James (and Marc),

Shame on me for not searching the archives. Thanks for pointing out 
your previous thread with Marc. I do want to make sure I'm 
understanding this properly. Using my demo results:

>  > hg.KEGG
>$list1
>Gene to KEGG test for over-representation
>190 KEGG ids tested (3 have p < 0.01)
>Selected gene set size: 280
>Gene universe size: 1629
>Annotation package: porcine
>
>$list2
>Gene to KEGG test for over-representation
>105 KEGG ids tested (1 have p < 0.01)
>Selected gene set size: 54
>Gene universe size: 1363
>Annotation package: porcine

So in list1, there were 280 selected genes that mapped to 190 KEGG 
ids, and in the universe there were 1629 genes that mapped to these 
same 190 KEGG ids? Where as in list2, there were 54 selected genes 
that mapped to 105 KEGG ids and 1363 genes in the universe that 
mapped to the same 105 KEGG ids, which is why the "Gene universe 
size" is different for the two lists?

Finally, I have to disagree with the reason why GO testing usually 
(always?) gives the same number for the "Gene universe size" - it's 
because once you have even a single gene map to any GO term, all the 
parent terms are pulled in as well, including the root term for the 
lineage - "biological process", "molecular function" or "cellular 
component". While this is included, the GO universe will _never_ be 
made smaller because any term in the universe that has a BP term will 
also map to the root "biological process" term.

Thanks,
Jenny





At 12:05 PM 7/5/2011, James F. Reid wrote:
>Hi Jenny,
>
>I think you'll find the answer in this thread:
>https://stat.ethz.ch/pipermail/bioconductor/2010-May/033439.html
>
>Best,
>J.
>
>On 07/05/2011 06:13 PM, Jenny Drnevich wrote:
>>Hi all,
>>
>>I'm doing both GO and KEGG over-representation testing on several
>>different lists of genes, using the same background set for each list.
>>What's got me puzzled is the difference in the "Gene universe size"
>>reported from the hyperGTest results for each list from the KEGG test,
>>even though they have the same background set. When I make a
>>GOHyperGParams object for each list and test them, the results report
>>the same "Gene universe size" for each list, which I assume to be the
>>number of genes in the background that have any GO MF terms. However,
>>for the KEGG test, each list reports a different "Gene universe size",
>>so I'm unsure how selecting a different list from the same background
>>can change the mapping of the background to KEGG terms. I haven't been
>>able to get into the exact code of calling hyperGTest on a
>>KEGGHyperGParams object, so I don't know what is going on - is it a bug?
>>Or for KEGG terms, is this supposed to happen? Reproducible example and
>>sessionInfo() below.
>>
>>Thanks,
>>Jenny
>>
>>  > library(annaffy)
>>Loading required package: Biobase
>>
>>Welcome to Bioconductor
>>
>>Vignettes contain introductory material. To view, type
>>'browseVignettes()'. To cite Bioconductor, see
>>'citation("Biobase")' and for packages 'citation("pkgname")'.
>>
>>Loading required package: GO.db
>>Loading required package: AnnotationDbi
>>Loading required package: DBI
>>
>>Loading required package: KEGG.db
>>
>>  > library(porcine.db)
>>Loading required package: org.Ss.eg.db
>>
>>
>>  > library(GOstats)
>>Loading required package: Category
>>Loading required package: graph
>>  >
>>  >
>>  > all.ids <- Rkeys(porcineENTREZID)
>>  > length(all.ids)
>>[1] 30160
>>  >
>>  >
>>  > set.seed(1234)
>>  > list1 <- sample(all.ids,5000)
>>  > list2 <- list1[1:1000]
>>  > list3 <- list1[4501:5000]
>>  >
>>  > par.MF.list <- list(list1 = new("GOHyperGParams", geneIds = list1,
>>universeGeneIds = all.ids,ontology="MF",
>>+ annotation="porcine.db", testDirection="over",
>>pvalueCutoff=0.01,conditional=F),
>>+ list2 = new("GOHyperGParams", geneIds = list2, universeGeneIds =
>>all.ids,ontology="MF",
>>+ annotation="porcine.db", testDirection="over",
>>pvalueCutoff=0.01,conditional=F) ,
>>+ list3 = new("GOHyperGParams", geneIds = list3, universeGeneIds =
>>all.ids,ontology="MF",
>>+ annotation="porcine.db", testDirection="over",
>>pvalueCutoff=0.01,conditional=F))
>>  >
>>  > hg.MF.list <- lapply(par.MF.list,hyperGTest)
>>  > hg.MF.list
>>$list1
>>Gene to GO MF test for over-representation
>>1007 GO MF ids tested (1 have p < 0.01)
>>Selected gene set size: 569
>>Gene universe size: 3198
>>Annotation package: porcine
>>
>>$list2
>>Gene to GO MF test for over-representation
>>419 GO MF ids tested (6 have p < 0.01)
>>Selected gene set size: 106
>>Gene universe size: 3198
>>Annotation package: porcine
>>
>>$list3
>>Gene to GO MF test for over-representation
>>266 GO MF ids tested (2 have p < 0.01)
>>Selected gene set size: 63
>>Gene universe size: 3198
>>Annotation package: porcine
>>
>>#Note the Gene universe size is 3198 for all 3 lists
>>
>>  >
>>  >
>>  > par.KEGG <- list(list1 = new("KEGGHyperGParams", geneIds = list1,
>>universeGeneIds = all.ids,
>>+ annotation="porcine.db", testDirection="over", pvalueCutoff=0.01),
>>+ list2= new("KEGGHyperGParams", geneIds = list2, universeGeneIds =
>>all.ids,
>>+ annotation="porcine.db", testDirection="over", pvalueCutoff=0.01) ,
>>+ list3= new("KEGGHyperGParams", geneIds = list3, universeGeneIds =
>>all.ids,
>>+ annotation="porcine.db", testDirection="over", pvalueCutoff=0.01) )
>>  >
>>  > hg.KEGG <- lapply(par.KEGG,hyperGTest)
>>  > hg.KEGG
>>$list1
>>Gene to KEGG test for over-representation
>>190 KEGG ids tested (3 have p < 0.01)
>>Selected gene set size: 280
>>Gene universe size: 1629
>>Annotation package: porcine
>>
>>$list2
>>Gene to KEGG test for over-representation
>>105 KEGG ids tested (1 have p < 0.01)
>>Selected gene set size: 54
>>Gene universe size: 1363
>>Annotation package: porcine
>>
>>$list3
>>Gene to KEGG test for over-representation
>>87 KEGG ids tested (1 have p < 0.01)
>>Selected gene set size: 30
>>Gene universe size: 1204
>>Annotation package: porcine
>>
>># Now there are 3 different Gene universe sizes: 1629, 1363 and 1204. WHY?
>>
>>  >
>>  >
>>  > sessionInfo()
>>R version 2.13.0 (2011-04-13)
>>Platform: x86_64-pc-mingw32/x64 (64-bit)
>>
>>locale:
>>[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>States.1252 LC_MONETARY=English_United States.1252
>>[4] LC_NUMERIC=C LC_TIME=English_United States.1252
>>
>>attached base packages:
>>[1] stats graphics grDevices utils datasets methods base
>>
>>other attached packages:
>>[1] GOstats_2.18.0 graph_1.30.0 Category_2.18.0 porcine.db_2.4.7
>>org.Ss.eg.db_2.5.0 annaffy_1.24.0
>>[7] KEGG.db_2.5.0 GO.db_2.5.0 RSQLite_0.9-4 DBI_0.2-5
>>AnnotationDbi_1.14.1 Biobase_2.12.1
>>
>>loaded via a namespace (and not attached):
>>[1] annotate_1.30.0 genefilter_1.34.0 GSEABase_1.14.0 RBGL_1.28.0
>>splines_2.13.0 survival_2.36-5 tools_2.13.0
>>[8] XML_3.4-0.2 xtable_1.5-6
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at r-project.org
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>Search the archives:
>>http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list