[BioC] understanding GOstats p-value

Sun Jan 6 00:34:12 CET 2008

Hi Janet,

Interpreting p-values for the hypergeometric test is not 
straightforward. One of the underlying assumptions of the hypergeometric 
is that the individual things being chosen are independent (think balls 
in an urn). Unfortunately, this is not true of genes or GO terms.

There are at least two types of dependence here. First, the expression 
of genes is not independent -- one gene can affect the expression of 
another. Second, the GO terms are set up as a directed acyclic graph, 
with child terms being subsets of the parent terms, so there is another 
level of dependence. You can use the conditional test to help limit this 
second level of dependence, but there isn't too much you can do about 
the first.

Because of this unknown dependence structure it is difficult to do any 
multiple testing correction for the hypergeometric for a single 
comparison, not to mention multiple comparisons. One thing I have done 
in the past for a single comparison is to do a monte carlo resampling in 
which you randomly select n 'differentially expressed' genes (where n is 
the number of observed differentially expressed genes that you have 
observed) and then see how many significant GO terms you get. Do this 
say 500 or 1000 times, and you will know how many terms you expect to 
see by chance alone, which gives you an estimate of the number of false 
positives in your observed results. Unfortunately, this is very time 
consuming, and I'm not sure if you could scale to multiple comparisons.

However, if you just have a small number of terms significant, it 
shouldn't be too difficult to do downstream validation of that result.

Best,

Jim

Janet Young wrote:
> Hi,
> 
> I have a fairly naive question - I want to make sure I can more or  
> less understand the p-values that GOstats hyperGTest comes out with.   
> Am I right in thinking the p-value is for enrichment of each category  
> individually (i.e. NOT corrected for multiple testing)?
> 
> I'm analyzing array CGH data so I am testing a lot of categories (my  
> universe is all human genes that have a chromosome position, GO  
> category and entrez ID).  Below is an example result - my  
> interpretation is that I shouldn't get super-excited about finding 3  
> categories with p<0.001 if I've tested 2261 categories (would expect  
> about 2 false positives).   Have I understood that correctly?
> 
>  > hgCondOver
> Gene to GO BP Conditional test for over-representation
> 2261 GO BP ids tested (3 have p < 0.001)
> Selected gene set size: 1433
>      Gene universe size: 12325
>      Annotation package: org.Hs.eg.db
>  >  summary(hgCondOver)
>                 GOBPID       Pvalue OddsRatio  ExpCount Count Size
> GO:0007156 GO:0007156 0.0001330755  2.470839 12.905720    27  111
> GO:0001894 GO:0001894 0.0007587546  5.553301  2.209087     8   19
> GO:0007600 GO:0007600 0.0009353695  1.446591 74.062556   100  637
>                                 Term
> GO:0007156 homophilic cell adhesion
> GO:0001894       tissue homeostasis
> GO:0007600       sensory perception
> 
> thanks very much,
> 
> Janet Young
> 
> -------------------------------------------------------------------
> 
> Dr. Janet Young (Trask lab)
> 
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
> 
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung at fhcrc.org
> 
> http://www.fhcrc.org/labs/trask/
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623