[BioC] testing GO categories with Fisher's exact test.
rdiaz at cnio.es
Wed Feb 25 11:11:28 MET 2004
On Wednesday 25 February 2004 10:07, michael watson (IAH-C) wrote:
> Forgive my naivety, but could one not use a chi-squared test here?
> We have an observed amount of genes in each category, and could calculate
> an expected from the size of the cluster and the distribution of all genes
> throughout GO categories...
I am not sure I see what your question is so I'll answer to two different
questions, and hope one of them does it:
1. Chi-square vs. Fisher's exact test: Yes, sure. That is, in fact, what some
tools do. Others directly use Fisher's exact test for contingency tables
because/if/when some of the usual assumptions of the chi-square test do not
hold (e.g., many of the cells have expected counts < 5; actual recommedations
vary; Agresti [Categorical data analysis], Conover [practical nonparametric
statistics], etc, provide more details on the "rules of thumb" for when not
to trust the chi-square approximation). Fisher's exact test is the classical
small sample test for independence in contingency tables.
2. Can you do a single test with all GO terms at the same time? I guess you
could, but as I said in the other email the problem is number of cells if the
number of GO terms is even modest: you have a bunch of genes (e.g., 10000)
and each is cross-classified according to presence/absence of each of the K
GO terms. (So "We have an observed amount of genes in each category" is not
actually an accurate description of the sampling scheme here). That gives you
a K-way contingency table; for any modest K, the number of cells blows up
(e.g., K = 20, you have 2^20 cells). And then, you need to be careful with
the loglinear model in terms of what hypothesis you want to test, and which
ones you are actually testing.
> -----Original Message-----
> From: Nicholas Lewin-Koh [mailto:nikko at hailmail.net]
> Sent: 24 February 2004 08:33
> To: bioconductor at stat.math.ethz.ch
> Cc: rdiaz at cnio.es
> Subject: [BioC] testing GO categories with Fisher's exact test.
> Hi all,
> I have a few questions about testing for over representation of terms in
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 10000, 7000
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for this example
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO category to detect
> differential representation of terms in the expressed
> set and correct for multiple testing.
> My question is on the validity of this procedure. Just from experience
> many genes will
> have multiple functions assigned to them so the genes falling into GO
> classes are not independent.
> Also, there is the large set of un-annotated genes so we are in effect
> ignoring the influence of
> all the unannotated genes on the outcome. Do people have any thoughts or
> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like FATIGO, EASE,
> DAVID etc. I find that
> the formal testing approach makes me very uncomfortable, especially as
> the biologists I work with tend to over interpret the results.
> I am very interested to see the discussion on this topic.
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
PGP KeyID: 0xE89B3462
More information about the Bioconductor