[BioC] Re: testing GO categories with Fisher's exact test.

Tue Feb 24 17:21:18 MET 2004

Hello Nicholas!

This is in continuation to Ramon's comments/suggestions to you. I have
done some work connected to functional enrichment assessment. In a
Bayesian framework it is actually possible to address some of the
questions you have raised, e.g. multiple hypotheses testing, lack of
/missing annotation, annotation quality of available annotation.

In a Bayesian framework inference is drawn using the joint distribution of
all the attributes, which are in this case GO annotations and
expression measurements for a gene. If there is dependence either within
functionalities or genes it is taken care off by this joint distribution
(of course only to the extent the model permits).

I have done some simple modelling accounting for annotation error too.
The models can take into account both missing annotation as well as
possible erroneous annotation. But so far I have worked with very
simplistic models and the real problem requires much more in-depth
analysis of the annotation information, where we (i.e. me and Ramon) have
made some progress but not satisfactory yet.

Another problem is successfully implementation of such huge Bayesian
models. Computation can be and often is quite difficult. Which is one of
the reasons why classical univariate hypothesis testing technique is so
popular, although obviously they are only approximate methods.

Some of us at Helsinki are trying to address the computation question too.
But I guess it's still some more time before we are ready to handle the
variety of unstructured dependence like the ones we see in microarray
data or functional data.

Wishes,

Madhu

On Tue, 24 Feb 2004, Ramon Diaz-Uriarte wrote:

> Dear Nicholas,
>
> On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote:
> > Hi all,
> > I have a few questions about testing for over representation of terms in
> > a cluster.
> > let's consider a simple case, a set of chips from an experiment say
> > treated and untreted with 10,000
> > genes on the chip and 1000 differentially expressed. Of the 10000, 7000
> > can be annotated and 6000 have
> > a GO function assinged to them at a suitible level. Say for this example
> > there are 30 Go clasess that appear.
> > I then conduct Fisher's exact test 30 times on each GO category to detect
> > differential representation of terms in the expressed
> > set and correct for multiple testing.
>
> I think I understand your setup. Just to double check, let me rephrase as:
> - for every one of the 30 GO terms, you set up a 2x2 contingency table (genes
> with/without the GO term by genes in class A vs. genes in class B), and carry
> out a Fisher's exact test, so you do 30 tests.
>
> However, I am not sure what you mean by "7000 can be annotated and 6000 have
> a GO function assinged to them at a suitible level". Does this mean that, if a
> gene has no GO annotation you will not introduce it into the above 2x2
> tables? It could be in the table (so the sum of entries in each of the 2x2 is
> 10000); it just goes to the "absent" cells.
>
> >
> > My question is on the validity of this procedure. Just from experience
> > many genes will
> > have multiple functions assigned to them so the genes falling into GO
> > classes are not independent.
>
> Yes, sure, though I'd rather reword it as saying that the presence/absence of
> a GO term X (e.g., metabolism) is not independent of the presence/absence of
> GO term Y (e.g., transport).
>
> However, I don't see this as an inherent problem. Suppose you measure arm
> length, body mass, and height, of a bunch of men and women, and carry out
> three t-tests. Of course, the three variables are correlated.
> Now, you might have used Hotelling's T-test for testing the null hypothesis
> that the multivariate mean (in the space defined by the three traits) of the
> sexes do not differ. But that is a different biological question from asking
> "do they differ in any one of the three traits", which is what you would be
> asking if you run 3 t-tests. [Some of these issues are discussed very nicely
> by W. Krzanowski in "Principles of multivariate analysis", pp. 235 -251 on
> the 1988 edition, and in the categorical variable case by Fienberg, "The
> analysis of cross-classified categorical data, 2nd ed", in pp. 20-21].
>
> From the above point of view, I think that many of the examples in Westfall &
> Young ("Resampling-based multiple testing") could also be reframed in a
> multivariate way. But they are not. The reason, I think, is that in most of
> these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are
> interested in fishing in a sea of univariate hypotheses. I think that most of
> the questions that biologists are asking in these cases are often univariate.
>
> A multivariate alternative would be to use a log-linear model of a 31-way
> contingency table: we have  10000 genes that we cross-classify according to
> group membership (differentially expressed or not), and each of the K = 30 GO
> terms (with two values for each term: present or basent). So we have a
> multidimensional table of 2 x 2^30. This won't work.
>
> > Also, there is the large set of un-annotated genes so we are in effect
> > ignoring the influence of
> > all the unannotated genes on the outcome.
>
>
>
> This relates to the more general problem of the quality of GO annotations,
> with two related problems:
>
> a) absence of annotation does not necessarily mean absence of that GO
> function, but maybe just that that particular aspect has never been studied
> for that gene;
>
> b) presence of an annotation does not mean that the gene really has that
> function, since there are msitakes in the annotation; in fact, GO has a bunch
> of levels for "quality of annotations" (see
> http://www.geneontology.org/GO.evidence.html
> ).
>
>
> It is my understanding that most tools, right now, just ignore these issues. I
> am not sure how serious the consequences are, but so far at least our
> experience seems to be that results make sense (e.g., see our examples in
>
> http://bioinfo.cnio.es/docus/papers/techreports.html#FatiGO-NNSP
>
> and
>
> http://bioinfo.cnio.es/docus/papers/techreports.html#camda-02).
>
> Of course, this is no excuse. A possible way would be to explicitly model what
> presence and absence of annotation mean, probably making use of the
> information contained in the "quality of annotations", within a bayesian
> framework.  M. Battacharjee and I have been working on it (but, because of my
> delays, this is becoming a never-ending project).
>
>
> > opinions on these approaches? It is
> > appearing all over the place in bioinformatics tools like FATIGO, EASE,
> > DAVID etc. I find that
>
> Yes, several people have had similar ideas. And I think there are a few other
> similar tools around.
>
> > the formal testing approach makes me very uncomfortable, especially as
> > the biologists I work with tend to over interpret the results.
>
>
> I don't see your last point: how the formal testing leads to
> overinterpretation.
>
>
> Best,
>
> Ramón
>
> --
> Ramón Díaz-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncológicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fernández Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900
>
> http://bioinfo.cnio.es/~rdiaz
> PGP KeyID: 0xE89B3462
> (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
>
>
>
>