[BioC] Hypergeometric Testing questions

Wed Dec 16 15:44:30 CET 2009

HI Javier,

Here's how I think about it - maybe it will help 
you. Say your background has 10,000 genes, of 
which 200 (SIZE) annotate to a particular GO 
term. If you have 500 genes in your significant 
list, you would expect to have 200/10,000 = X/500 
or X=10 (EXPCOUNT) genes with that GO term if 
they were randomly sampled. However, in your list 
of 500 genes, 25 (COUNT) have that GO term. 
Therefore, the over-expression testing is a 
sampling probability problem: If you sample 500 
genes out of 10,000, of which 200 are term Y, is 
getting 25 of them more than you would expect due to chance alone?

HTH,
Jenny

At 06:33 AM 12/16/2009, Javier PÃ©rez Florido wrote:

>>You might find reading the source code in 
>>package Category file R/hyperGTest-methods.R to be helpful.
>>
>>For a given GO ID, the test proceeds by 
>>considering an urn containing the genes in the 
>>gene universe.  Genes that are annotated at our 
>>GO ID are white balls in the urn and the rest 
>>of the genes are black balls in the urn.  We 
>>will draw balls from the urn according to the 
>>number of genes in the selected gene list.  This leads to a 2x2 table like:
>>
>>            inGO   notGO
>>            white  black
>>selected   n11    n12
>>not        n21    n22
>>
>>The expected value for n11 is:
>>(n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22)
>>
>>If you want more details, take a look at the source code in Category.
>>
>>+ seth
>
>Thanks Seth, but looking at the code I'm a 
>little bit confused. Checking the help pages, I 
>would try to explain the meaning of some fields:
>- ExpCount: the expected number of genes in the 
>selected gene list to be found at each tested category
>- Count: how many instances of that term were 
>actually observed in the gene list
>- Size: number that could have been found in the 
>gene list if every instance had turned up.
>
>
>When we are testing for over-representation, 
>Count is greater than Expected Count. What I 
>don't see is why it is important to measure the 
>expected Count. Another question is the 
>relationship between the Expected Count and 
>Count. It has to be small or big for a term being interesting?
>About Size field, it is the number of genes that 
>could have been found in the interesting gene 
>list if every instance is present. Present where?
>
>Thanks again and apologize for these questions, 
>but I it is quite difficult for me to understand 
>the meaning of these fields looking at the code.
>Javier
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu