[BioC] hypergeometric test in GOstats

Tue Mar 17 23:36:44 CET 2009

Hi Marc,
thanks for your help. I think my confusion came from the fact that I had 
already performed non-specific filtering to only include genes with GO 
annotations. So I assumed that all the genes were being included in the 
test but I now realise the non-specific filtering only ensures that the 
genes have annotation of at least 1 of the GO trees....
thanks again,
Sebastien

Marc Carlson wrote:
> Hi Sebastien,
>
> The source of your confusion on this is the fact that not every gene is
> tagged with a GO term.  Thus, the size of both your gene universe and
> your gene samples are not what you think they are.  They are actually
> both smaller because only those genes which have GO annotation in some
> form can reasonably be included.  After GOstats gets finished paring
> down the gene universe it becomes 1404 long instead of 1803, and then it
> has to only include things in your sample that are actually contained
> within that "pared down" universe which results in your sample being
> only 213 long instead of 266.  That is why the expected count is actually
>
> 213*124/1404 = 18.81197 instead of 18.29.
>
> Actually, you could have even seen that this paring down had occurred by
> looking at the hgOver object that your code produces:
>
>   
>> hgOver
>>     
> Gene to GO BP  test for over-representation
> 1244 GO BP ids tested (131 have p < 0.05)
> Selected gene set size: 213
>     Gene universe size: 1404
>     Annotation package: mouse4302
>
> Hope this clarifies things!
>
>
>   Marc
>
>
>
>
> Sebastien Gerega wrote:
>   
>> Hi Robert,
>> sorry about not having included the sessionInfo(). I have included all
>> the required information this time - my sessionInfo() and links to the
>> input data can be found at the end of the email.
>>
>>
>> Here is the code that I am using:
>> entrezUni = read.table("entrezUni.tsv", sep="\t")[,1]
>> sigEntrez = read.table("sigEntrez.tsv", sep="\t")[,1]
>> params = new("GOHyperGParams", geneIds=as.integer(sigEntrez),
>> universeGeneIds = as.integer(entrezUni),
>>    annotation="mouse4302.db", pvalueCutoff=0.05, testDirection="over",
>> ontology = "BP", conditional=FALSE)
>> hgOver = hyperGTest(params)
>> report = summary(hgOver)
>> length(entrezUni)
>> length(sigEntrez)
>> report[1:3,]
>>
>>
>> and the output:
>>     
>>> length(entrezUni)
>>>       
>> [1] 1803
>>     
>>> length(sigEntrez)
>>>       
>> [1] 266
>>     
>>> report[1:3,]
>>>       
>>      GOBPID       Pvalue OddsRatio ExpCount Count
>> Size                  Term
>> 1 GO:0006952 1.519928e-16  5.660429 18.81197    55  124      defense
>> response
>> 2 GO:0050896 9.662799e-16  3.570590 50.36752    99  332  response to
>> stimulus
>> 3 GO:0002376 9.810622e-16  4.164349 31.70726    74  209 immune system
>> process
>>
>>
>> So I am not using a conditional test and I have searched through the
>> mailing list and documentation for information about how ExpCount is
>> calculated but have not found an answer.
>> I would have expected it to be a simple case of
>>
>> Size/universeGeneIds*geneIds
>>
>> However, this is not the case:
>>     
>>> man = report[1:3,]$Size/length(entrezUni)*length(sigEntrez)
>>> report[1:3,]$ExpCount - man
>>>       
>> [1] 0.5180113 1.3869335 0.8730997
>>
>>
>> I understand that I may be omitting some details or steps in the
>> process but I have searched and been unable to find any information
>> regarding this.
>> Your help is very much appreciated.
>> regards,
>> Sebastien
>>
>> http://sydneybioinformatics.org/download/sigEntrez.tsv
>> http://sydneybioinformatics.org/download/entrezUni.tsv
>>
>>     
>>> sessionInfo()
>>>       
>> R version 2.8.1 (2008-12-22)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
>>
>>
>> attached base packages:
>> [1] splines   tools     stats     graphics  grDevices utils    
>> datasets  methods   base   
>> other attached packages:
>> [1] GOstats_2.8.0       Category_2.8.2      genefilter_1.22.0  
>> survival_2.34-1     RBGL_1.18.0         annotate_1.20.1    
>> xtable_1.5-4      [8] GO.db_2.2.5         graph_1.20.0       
>> mouse4302.db_2.2.5  RSQLite_0.7-1       DBI_0.2-4          
>> AnnotationDbi_1.4.2 Biobase_2.2.1    
>> loaded via a namespace (and not attached):
>> [1] cluster_1.11.12 GSEABase_1.4.0  XML_1.99-0
>>
>>
>> Robert Gentleman wrote:
>>     
>>> Hi Sebastien,
>>>   It is expected that you do a little bit of the homework before
>>> posting.
>>> Some things to try:
>>>
>>> 1) please read the posting guide, you need to give us information on
>>> your
>>> particular version of R and Bioconductor, so that we can attempt to
>>> reproduce
>>> the results you have.
>>> 2) you need to give a reproducible example, so we can test and make
>>> sure we are
>>> getting the same answer as you do.  In this case you did not show us
>>> even the
>>> call you used, so we have no way of knowing what was computed, and
>>> hence cannot
>>> do more than speculate - which is a waste of everyone's time.
>>> 3) you need to read the documentation, there are manual pages and a
>>> vignette.
>>> These are sometimes unclear, and/or incomplete, and letting us know
>>> what is not
>>> clear helps us to improve them.
>>> 4) you should check the mailing list archive to see if the topic has
>>> already
>>> been discussed (this one has come up very many times), and then you
>>> can readily
>>> obtain your answer.
>>>
>>> So, some speculation, I suspect that your test is conditional.
>>> And if I understand your example, this is easily checked by a direct
>>> call to
>>> fisher.test, which is different from the parts that you did report,
>>> again
>>> suggesting that your testing is conditional.
>>>
>>>  
>>>       
>>>> xm
>>>>     
>>>>         
>>>      [,1] [,2]
>>> [1,]   55  211
>>> [2,]   69 1468
>>>  
>>>       
>>>> fisher.test(xm)
>>>>     
>>>>         
>>>         Fisher's Exact Test for Count Data
>>>
>>> data:  xm
>>> p-value < 2.2e-16
>>> alternative hypothesis: true odds ratio is not equal to 1
>>> 95 percent confidence interval:
>>>  3.701153 8.260393
>>> sample estimates:
>>> odds ratio
>>>   5.537531
>>>
>>> best wishes
>>>   Robert
>>>
>>> Sebastien Gerega wrote:
>>>  
>>>       
>>>> Hi,
>>>> I would like to get a better understanding of exactly how the
>>>> calculations in GOstats are performed.
>>>> Does the package use a standard hypergeometric test?
>>>> If I have the following values:
>>>> geneIds = 266
>>>> universeGeneIds = 1803
>>>>
>>>> and a particular GO term has:
>>>> size = 124
>>>> count = 55
>>>>
>>>> then the associated p-value, odds ratio and expected count are
>>>> 1.519928e-16, 5.660429, and 18.81197 respectively.
>>>>
>>>> However, I would have thought the expected count would be 266 * 124 /
>>>> 1803 = 18.29. In addition, the p-value obtained from the hypergeometric
>>>> test using the following website
>>>> http://keisan.casio.com/has10/SpecExec.cgi
>>>> is different.
>>>>
>>>> Are there any steps performed by the GOstats package that make it
>>>> different from a standard hypergeo test? What is the reason for these
>>>> differences?
>>>> thanks in advance,
>>>> Sebastien
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>     
>>>>         
>>>       
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     
>
>
>
>