[BioC] testing GO categories with Fisher's exact test.

Wed Feb 25 14:50:56 MET 2004

I should add to this thread that there is existing software that will do
resampling to assess global significance of the p-values obtained from
this sort of analysis.

http://dot.ped.med.umich.edu:2000/pub/sig_terms/index.htm

Best,

Jim

James W. MacDonald
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

>>> <cberry at tajo.ucsd.edu> 02/24/04 02:23PM >>>
On Tue, 24 Feb 2004, Nicholas Lewin-Koh wrote:

> Hi all,
> I have a few questions about testing for over representation of terms
in
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 10000,
7000
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for this
example
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO category to
detect
> differential representation of terms in the expressed
> set and correct for multiple testing.
> 
> My question is on the validity of this procedure. 

It depends on what hypotheses you wish to test. The uniform
distribution
of the p value under the null hypothesis depends on ***all*** the
assumptions of the test obtaining.

The trouble is that you probably do not want to test whether the genes
on
your microarray are independent, since you already know that they are
not:

> Just from experience
> many genes will
> have multiple functions assigned to them so the genes falling into
GO
> classes are not independent.

> Also, there is the large set of un-annotated genes so we are in
effect
> ignoring the influence of 
> all the unannotated genes on the outcome. Do people have any thoughts
or
> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like FATIGO,
EASE,
> DAVID etc. 

SAM and similar permutation based approaches can be implemented for
this
setup to get p-values (or FDR's) that do not depend on independence of
genes/transcripts.

The results given by permutation (of sample identities using the
hypergeometric p-value as the test statistic) are several orders of
magnitude more conservative than using the original 'p-value' even
without
correcting for multiple comparisons in several data sets I have seen.

I recall someone from the MAPPfinder group remarking at a conference
last
July that MAPPfinder 2.0 would implement permutation methods. But I
cannot
find this release yet using google.

Another approach to permutation testing of expression vs ontology is
outlined in:

Mootha VK et al. PGC-1 -responsive genes involved in
oxidative phosphorylation are coordinately downregulated in human
diabetes. Nature Genetics, 34(3):267 73, 2003.

I find that 
> the formal testing approach makes me very uncomfortable, especially
as
> the biologists I work with tend to over interpret the results.

Testing a better focussed hypothesis should increase your comfort
level.

:-)

> I am very interested to see the discussion on this topic.
> 
> Nicholas
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch 
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor 
> 

Charles C. Berry                        (858) 534-2098 
                                         Dept of Family/Preventive
Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://hacuna.ucsd.edu/members/ccb.html  La Jolla, San Diego
92093-0717

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch 
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor