[BioC] testing GO categories with Fisher's exact test.

Tue Feb 24 12:38:16 MET 2004

Hello,

Here're some thoughts on the GO testing ... 

I don't know how the GO module in BioC works - but I guess you're looking at
just one GO class, don't you? Note, that one should always treat the three
categories independent - although they're clearly not independent ...).

I think the fisher test is ok to test for independence of the variables e.g.
apoptosis in gene set versus apoptosis on while chip ... . The fact only 50
to 60% of a chip can be annotated via a GO class should not matter. You're
assuming that these 50% of the genes are representative for the entrire
population of genes. This kind of assumption is common in many areas of
science I guess.

Anyway, even if *all* un-annotated genes are falling into completely new (yet
unknown) categories, this doesn't realy matter - that fact that say
"apoptosis" si strongly over-represented in your dataset would not be
influenced. However, if *unproportionally* many un-annotated genes are in
fact involved in apoptosis, then you're assumtion is wrong and you've got a
problem.

So, I don't worry too much about the fact that only a portion of my genes are
annotated with GO - otherwise I couldn't do any kind of analysis :-(

I was wondering about the multiple annotation isssue, too. One gene is often
annotated several times in a GO class, but I think this is not a problem, and
it makes "biologically" sense.

Especially for molecular_fucntion you can have a range of functions for one
protein - at least one per domain. The biochemical functions for the domains
may be independent. Especially when annotating protiens via InterPro you may
end up with a per domain based GO annotation for your protein.

For cellular compartment this may be different. Biological_process is more
tricky I think. You may just expect your protein to be involved in just one
biological process. However, this is not true. Say Glycogen Synthase Kinase 3
beta is a protein involved in a signalling pathway controlling cell
proliferation and glycogen synthesis. Basically these are two independent
pathways (actually I think this is not realy reflected in GO biol. proc. for
this protein :-( ).

Following the above example one could test the number of annotations rather
than genes. If a gene is annotated with 2 independent GO terms within e.g.
biol. proc (i.e. terms that do not belong to the same branch), then you'd the
2 fucntions twice!

I guess you're doing this implicitely anyway - one gene can increase the
"counter" (the frequency of observation) for several terms. This is fine as
long as you do this for the population (the chip) and the sample (your data
set).

I'd like to extend the GO testing issue a bit further. Most people look at
just over-represented terms (over-represented in the dataset with respect to
what's expecxted by chance when sampling from the chip). What about
under-representation. The biological interpretation of over-representation is
easy (this biological process is stronlgy effected ...), but how do you
interpret a significant under-representation?

Following the above, it's also important to consider the correct tail of the
distribution, i.e. to set the alternative of the fisher.test to "two.sided",
"less" or "greater" ...

	kind regards,

	Arne

--
Arne Muller, Ph.D.
Aventis Pharma, Drug Safety Evaluation

> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Nicholas
> Lewin-Koh
> Sent: 24 February 2004 09:33
> To: bioconductor at stat.math.ethz.ch
> Cc: rdiaz at cnio.es
> Subject: [BioC] testing GO categories with Fisher's exact test.
> 
> 
> Hi all,
> I have a few questions about testing for over representation 
> of terms in
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 
> 10000, 7000
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for 
> this example
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO 
> category to detect
> differential representation of terms in the expressed
> set and correct for multiple testing.
> 
> My question is on the validity of this procedure. Just from experience
> many genes will
> have multiple functions assigned to them so the genes falling into GO
> classes are not independent.
> Also, there is the large set of un-annotated genes so we are in effect
> ignoring the influence of 
> all the unannotated genes on the outcome. Do people have any 
> thoughts or
> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like 
> FATIGO, EASE,
> DAVID etc. I find that 
> the formal testing approach makes me very uncomfortable, especially as
> the biologists I work with tend to over interpret the results.
> I am very interested to see the discussion on this topic.
> 
> Nicholas
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>