[BioC] testing GO categories with Fisher's exact test.
nikko at hailmail.net
Wed Feb 25 11:58:35 MET 2004
I am crafting a longer reply which I will send later. But, in relation to
the chi-squared test yes
you could do a chi-sqared test, but as Ramon poionts out the reason to
use Fisher's exact test
is exactly that for small samples the test performs better. Though I have
found it very unsatisfying that Fisher's tea lady needed 4 out 4 correct
for the result
to be sugnificant :) .
In regard to point 2, In theory if you fit the complete model, you would
have 2^20 for a k-way
table. However in this problem you could collapse a lot of the
many of the cells would have 0's, and with 0 information inference is
Also, one could make some fairly loose assumptions on the correlation
structure and probably reduce
the dimensionality further.
On Wed, 25 Feb 2004 11:11:28 +0100, "Ramon Diaz-Uriarte" <rdiaz at cnio.es>
> Dear Michael,
> On Wednesday 25 February 2004 10:07, michael watson (IAH-C) wrote:
> > Forgive my naivety, but could one not use a chi-squared test here?
> > We have an observed amount of genes in each category, and could calculate
> > an expected from the size of the cluster and the distribution of all genes
> > throughout GO categories...
> I am not sure I see what your question is so I'll answer to two different
> questions, and hope one of them does it:
> 1. Chi-square vs. Fisher's exact test: Yes, sure. That is, in fact, what
> tools do. Others directly use Fisher's exact test for contingency tables
> because/if/when some of the usual assumptions of the chi-square test do
> hold (e.g., many of the cells have expected counts < 5; actual
> vary; Agresti [Categorical data analysis], Conover [practical
> statistics], etc, provide more details on the "rules of thumb" for when
> to trust the chi-square approximation). Fisher's exact test is the
> small sample test for independence in contingency tables.
> 2. Can you do a single test with all GO terms at the same time? I guess
> could, but as I said in the other email the problem is number of cells if
> number of GO terms is even modest: you have a bunch of genes (e.g.,
> and each is cross-classified according to presence/absence of each of the
> GO terms. (So "We have an observed amount of genes in each category" is
> actually an accurate description of the sampling scheme here). That gives
> a K-way contingency table; for any modest K, the number of cells blows up
> (e.g., K = 20, you have 2^20 cells). And then, you need to be careful
> the loglinear model in terms of what hypothesis you want to test, and
> ones you are actually testing.
> > ?
> > -----Original Message-----
> > From: Nicholas Lewin-Koh [mailto:nikko at hailmail.net]
> > Sent: 24 February 2004 08:33
> > To: bioconductor at stat.math.ethz.ch
> > Cc: rdiaz at cnio.es
> > Subject: [BioC] testing GO categories with Fisher's exact test.
> > Hi all,
> > I have a few questions about testing for over representation of terms in
> > a cluster.
> > let's consider a simple case, a set of chips from an experiment say
> > treated and untreted with 10,000
> > genes on the chip and 1000 differentially expressed. Of the 10000, 7000
> > can be annotated and 6000 have
> > a GO function assinged to them at a suitible level. Say for this example
> > there are 30 Go clasess that appear.
> > I then conduct Fisher's exact test 30 times on each GO category to detect
> > differential representation of terms in the expressed
> > set and correct for multiple testing.
> > My question is on the validity of this procedure. Just from experience
> > many genes will
> > have multiple functions assigned to them so the genes falling into GO
> > classes are not independent.
> > Also, there is the large set of un-annotated genes so we are in effect
> > ignoring the influence of
> > all the unannotated genes on the outcome. Do people have any thoughts or
> > opinions on these approaches? It is
> > appearing all over the place in bioinformatics tools like FATIGO, EASE,
> > DAVID etc. I find that
> > the formal testing approach makes me very uncomfortable, especially as
> > the biologists I work with tend to over interpret the results.
> > I am very interested to see the discussion on this topic.
> > Nicholas
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> Ramón Díaz-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncológicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fernández Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900
> PGP KeyID: 0xE89B3462
More information about the Bioconductor