[BioC] testing GO categories with Fisher's exact test.

Wed Feb 25 11:58:35 MET 2004

Hi,
I am crafting a longer reply which I will send later. But, in relation to
the chi-squared test yes 
you could do a chi-sqared test, but as Ramon poionts out the reason to
use Fisher's exact test
is exactly that for small samples the test performs better. Though I have
always
found it very unsatisfying that Fisher's tea lady needed 4 out 4 correct
for the result
to be sugnificant :) .

In regard to point 2, In theory if you fit the complete model, you would
have 2^20 for a k-way
table. However in this problem you could collapse a lot of the
dimensionality as
many of the cells would have 0's, and with 0 information inference is
hard.
Also, one could make some fairly loose assumptions on the correlation
structure and probably reduce
the dimensionality further. 

My 2c

Nicholas
On Wed, 25 Feb 2004 11:11:28 +0100, "Ramon Diaz-Uriarte" <rdiaz at cnio.es>
said:
> Dear Michael,
> 
> On Wednesday 25 February 2004 10:07, michael watson (IAH-C) wrote:
> > Forgive my naivety, but could one not use a chi-squared test here?
> > We have an observed amount of genes in each category, and could calculate
> > an expected from the size of the cluster and the distribution of all genes
> > throughout GO categories...
> 
> I am not sure I see what your question is so I'll answer to two different 
> questions, and hope one of them does it:
> 
> 1. Chi-square vs. Fisher's exact test: Yes, sure. That is, in fact, what
> some 
> tools do. Others directly use Fisher's exact test for contingency tables 
> because/if/when some of the usual assumptions of the chi-square test do
> not 
> hold (e.g., many of the cells have expected counts < 5; actual
> recommedations 
> vary; Agresti [Categorical data analysis], Conover [practical
> nonparametric 
> statistics], etc, provide more details on the "rules of thumb" for when
> not 
> to trust the chi-square approximation). Fisher's exact test is the
> classical 
> small sample test for independence in contingency tables.
> 
> 2. Can you do a single test with all GO terms at the same time? I guess
> you 
> could, but as I said in the other email the problem is number of cells if
> the 
> number of GO terms is even modest: you have a bunch of genes (e.g.,
> 10000) 
> and each is cross-classified according to presence/absence of each of the
> K 
> GO terms. (So "We have an observed amount of genes in each category" is
> not 
> actually an accurate description of the sampling scheme here). That gives
> you 
> a K-way contingency table; for any modest K, the number of cells blows up 
> (e.g., K = 20, you have 2^20 cells). And then, you need to be careful
> with 
> the loglinear model in terms of what hypothesis you want to test, and
> which 
> ones you are actually testing.
> 
> Best,
> 
> R.
> 
> 
> 
> 
> >
> > ?
> >
> > -----Original Message-----
> > From: Nicholas Lewin-Koh [mailto:nikko at hailmail.net]
> > Sent: 24 February 2004 08:33
> > To: bioconductor at stat.math.ethz.ch
> > Cc: rdiaz at cnio.es
> > Subject: [BioC] testing GO categories with Fisher's exact test.
> >
> >
> > Hi all,
> > I have a few questions about testing for over representation of terms in
> > a cluster.
> > let's consider a simple case, a set of chips from an experiment say
> > treated and untreted with 10,000
> > genes on the chip and 1000 differentially expressed. Of the 10000, 7000
> > can be annotated and 6000 have
> > a GO function assinged to them at a suitible level. Say for this example
> > there are 30 Go clasess that appear.
> > I then conduct Fisher's exact test 30 times on each GO category to detect
> > differential representation of terms in the expressed
> > set and correct for multiple testing.
> >
> > My question is on the validity of this procedure. Just from experience
> > many genes will
> > have multiple functions assigned to them so the genes falling into GO
> > classes are not independent.
> > Also, there is the large set of un-annotated genes so we are in effect
> > ignoring the influence of
> > all the unannotated genes on the outcome. Do people have any thoughts or
> > opinions on these approaches? It is
> > appearing all over the place in bioinformatics tools like FATIGO, EASE,
> > DAVID etc. I find that
> > the formal testing approach makes me very uncomfortable, especially as
> > the biologists I work with tend to over interpret the results.
> > I am very interested to see the discussion on this topic.
> >
> > Nicholas
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> 
> -- 
> Ramón Díaz-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncológicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fernández Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900
> 
> http://bioinfo.cnio.es/~rdiaz
> PGP KeyID: 0xE89B3462
> (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
> 
> 
>