[BioC] hypergeometric test in GOstats

Sebastien Gerega seb at gerega.net
Fri Mar 13 02:41:47 CET 2009


Hi Robert,
sorry about not having included the sessionInfo(). I have included all 
the required information this time - my sessionInfo() and links to the 
input data can be found at the end of the email.


Here is the code that I am using:
entrezUni = read.table("entrezUni.tsv", sep="\t")[,1]
sigEntrez = read.table("sigEntrez.tsv", sep="\t")[,1]
params = new("GOHyperGParams", geneIds=as.integer(sigEntrez), 
universeGeneIds = as.integer(entrezUni),
    annotation="mouse4302.db", pvalueCutoff=0.05, testDirection="over", 
ontology = "BP", conditional=FALSE)
hgOver = hyperGTest(params)
report = summary(hgOver)
length(entrezUni)
length(sigEntrez)
report[1:3,]


and the output:
 > length(entrezUni)
[1] 1803
 > length(sigEntrez)
[1] 266
 > report[1:3,]
      GOBPID       Pvalue OddsRatio ExpCount Count Size                  
Term
1 GO:0006952 1.519928e-16  5.660429 18.81197    55  124      defense 
response
2 GO:0050896 9.662799e-16  3.570590 50.36752    99  332  response to 
stimulus
3 GO:0002376 9.810622e-16  4.164349 31.70726    74  209 immune system 
process


So I am not using a conditional test and I have searched through the 
mailing list and documentation for information about how ExpCount is 
calculated but have not found an answer.
I would have expected it to be a simple case of

Size/universeGeneIds*geneIds

However, this is not the case:
 > man = report[1:3,]$Size/length(entrezUni)*length(sigEntrez)
 > report[1:3,]$ExpCount - man
[1] 0.5180113 1.3869335 0.8730997


I understand that I may be omitting some details or steps in the process 
but I have searched and been unable to find any information regarding this.
Your help is very much appreciated.
regards,
Sebastien

http://sydneybioinformatics.org/download/sigEntrez.tsv
http://sydneybioinformatics.org/download/entrezUni.tsv

 > sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252

attached base packages:
[1] splines   tools     stats     graphics  grDevices utils     
datasets  methods   base    

other attached packages:
 [1] GOstats_2.8.0       Category_2.8.2      genefilter_1.22.0   
survival_2.34-1     RBGL_1.18.0         annotate_1.20.1     
xtable_1.5-4      
 [8] GO.db_2.2.5         graph_1.20.0        mouse4302.db_2.2.5  
RSQLite_0.7-1       DBI_0.2-4           AnnotationDbi_1.4.2 
Biobase_2.2.1     

loaded via a namespace (and not attached):
[1] cluster_1.11.12 GSEABase_1.4.0  XML_1.99-0


Robert Gentleman wrote:
> Hi Sebastien,
>   It is expected that you do a little bit of the homework before posting.
> Some things to try:
>
> 1) please read the posting guide, you need to give us information on your
> particular version of R and Bioconductor, so that we can attempt to reproduce
> the results you have.
> 2) you need to give a reproducible example, so we can test and make sure we are
> getting the same answer as you do.  In this case you did not show us even the
> call you used, so we have no way of knowing what was computed, and hence cannot
> do more than speculate - which is a waste of everyone's time.
> 3) you need to read the documentation, there are manual pages and a vignette.
> These are sometimes unclear, and/or incomplete, and letting us know what is not
> clear helps us to improve them.
> 4) you should check the mailing list archive to see if the topic has already
> been discussed (this one has come up very many times), and then you can readily
> obtain your answer.
>
> So, some speculation, I suspect that your test is conditional.
> And if I understand your example, this is easily checked by a direct call to
> fisher.test, which is different from the parts that you did report, again
> suggesting that your testing is conditional.
>
>   
>> xm
>>     
>      [,1] [,2]
> [1,]   55  211
> [2,]   69 1468
>   
>> fisher.test(xm)
>>     
>
>         Fisher's Exact Test for Count Data
>
> data:  xm
> p-value < 2.2e-16
> alternative hypothesis: true odds ratio is not equal to 1
> 95 percent confidence interval:
>  3.701153 8.260393
> sample estimates:
> odds ratio
>   5.537531
>
> best wishes
>   Robert
>
> Sebastien Gerega wrote:
>   
>> Hi,
>> I would like to get a better understanding of exactly how the
>> calculations in GOstats are performed.
>> Does the package use a standard hypergeometric test?
>> If I have the following values:
>> geneIds = 266
>> universeGeneIds = 1803
>>
>> and a particular GO term has:
>> size = 124
>> count = 55
>>
>> then the associated p-value, odds ratio and expected count are
>> 1.519928e-16, 5.660429, and 18.81197 respectively.
>>
>> However, I would have thought the expected count would be 266 * 124 /
>> 1803 = 18.29. In addition, the p-value obtained from the hypergeometric
>> test using the following website http://keisan.casio.com/has10/SpecExec.cgi
>> is different.
>>
>> Are there any steps performed by the GOstats package that make it
>> different from a standard hypergeo test? What is the reason for these
>> differences?
>> thanks in advance,
>> Sebastien
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     
>
>



More information about the Bioconductor mailing list