[BioC] Re: testing GO categories with Fisher's exact test.

Ramon Diaz-Uriarte rdiaz at cnio.es
Tue Feb 24 16:42:02 MET 2004

Dear Nicholas,

On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote:
> Hi all,
> I have a few questions about testing for over representation of terms in
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 10000, 7000
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for this example
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO category to detect
> differential representation of terms in the expressed
> set and correct for multiple testing.

I think I understand your setup. Just to double check, let me rephrase as:
- for every one of the 30 GO terms, you set up a 2x2 contingency table (genes 
with/without the GO term by genes in class A vs. genes in class B), and carry 
out a Fisher's exact test, so you do 30 tests.

However, I am not sure what you mean by "7000 can be annotated and 6000 have
a GO function assinged to them at a suitible level". Does this mean that, if a 
gene has no GO annotation you will not introduce it into the above 2x2 
tables? It could be in the table (so the sum of entries in each of the 2x2 is 
10000); it just goes to the "absent" cells.

> My question is on the validity of this procedure. Just from experience
> many genes will
> have multiple functions assigned to them so the genes falling into GO
> classes are not independent.

Yes, sure, though I'd rather reword it as saying that the presence/absence of 
a GO term X (e.g., metabolism) is not independent of the presence/absence of 
GO term Y (e.g., transport). 

However, I don't see this as an inherent problem. Suppose you measure arm 
length, body mass, and height, of a bunch of men and women, and carry out 
three t-tests. Of course, the three variables are correlated. 
Now, you might have used Hotelling's T-test for testing the null hypothesis 
that the multivariate mean (in the space defined by the three traits) of the 
sexes do not differ. But that is a different biological question from asking 
"do they differ in any one of the three traits", which is what you would be 
asking if you run 3 t-tests. [Some of these issues are discussed very nicely 
by W. Krzanowski in "Principles of multivariate analysis", pp. 235 -251 on 
the 1988 edition, and in the categorical variable case by Fienberg, "The 
analysis of cross-classified categorical data, 2nd ed", in pp. 20-21]. 

>From the above point of view, I think that many of the examples in Westfall & 
Young ("Resampling-based multiple testing") could also be reframed in a 
multivariate way. But they are not. The reason, I think, is that in most of 
these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are 
interested in fishing in a sea of univariate hypotheses. I think that most of 
the questions that biologists are asking in these cases are often univariate.

A multivariate alternative would be to use a log-linear model of a 31-way 
contingency table: we have  10000 genes that we cross-classify according to 
group membership (differentially expressed or not), and each of the K = 30 GO 
terms (with two values for each term: present or basent). So we have a 
multidimensional table of 2 x 2^30. This won't work.

> Also, there is the large set of un-annotated genes so we are in effect
> ignoring the influence of
> all the unannotated genes on the outcome.

This relates to the more general problem of the quality of GO annotations, 
with two related problems:

a) absence of annotation does not necessarily mean absence of that GO 
function, but maybe just that that particular aspect has never been studied 
for that gene;

b) presence of an annotation does not mean that the gene really has that 
function, since there are msitakes in the annotation; in fact, GO has a bunch 
of levels for "quality of annotations" (see

It is my understanding that most tools, right now, just ignore these issues. I 
am not sure how serious the consequences are, but so far at least our 
experience seems to be that results make sense (e.g., see our examples in 




Of course, this is no excuse. A possible way would be to explicitly model what 
presence and absence of annotation mean, probably making use of the 
information contained in the "quality of annotations", within a bayesian 
framework.  M. Battacharjee and I have been working on it (but, because of my 
delays, this is becoming a never-ending project).

> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like FATIGO, EASE,
> DAVID etc. I find that

Yes, several people have had similar ideas. And I think there are a few other 
similar tools around.

> the formal testing approach makes me very uncomfortable, especially as
> the biologists I work with tend to over interpret the results.

I don't see your last point: how the formal testing leads to 



Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

PGP KeyID: 0xE89B3462

More information about the Bioconductor mailing list