[BioC] GOstats, geneCounts and gene universe filtering...

Thu May 10 16:53:30 CEST 2007

Hi Jesper,

Jesper Ryge <Jesper.Ryge at ki.se> writes:
> Im trying to perform an enrichment analysis for GO terms on my  
> microarray results. my problem arises when i noticed that the  
> geneCount(x) doesnt match the amount  of genes annotated at certain  
> nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if  
> that is actually ok or if i missed something? i thought the geneCount  
> was the number of interesting genes (from the list fed to geneIds)  
> that belongs to a particular GO term and that geneIdsByCategory  
> should list those genes, i.e the numbers should match?  this turned  
> out not to be the case on at least two of the GO nodes in the list of  
> significant over-represented GO terms:
>
>  > length(geneIdsByCategory(test)[["GO:0051179"]])
> [1] 89
>  > geneCounts(test)["GO:0051179"]
> GO:0051179
>          20
>  > length(geneIdsByCategory(test)[["GO:0007409"]])
> [1] 13
>  > geneCounts(test)["GO:0007409"]
> GO:0007409
>           6
> test is the output from hyperGTest(params),  a conditional test for  
> over representation on the rat2302 chip.
>
> As i said i might have missed something, but it puzzles me somewhat.  
> comments welcome:-)

This doesn't look right to me either.  Can you please send your
sessionInfo() so I'm certain what versions of things you are using?  I
suspect there is a bug in how these functions handle the conditional
case.

> As a "bonus" question i was wondering if there is any consensus  
> regarding filtering the gene universe before doing the GO enrichment  
> analysis? i know its recommended in the GOstats manual, for instance  
> by removing probe sets with little variation across samples using IQR  
> (or some similar measure). but in the topGO package by adrian Alexa  
> they seems to care little about this issue and use all GO annotated  
> probe sets from the chip used in the particular study.

Perhaps that answers your question: there is not widespread consensus.

> i was wondering, if u reduce the set of genes from the gene universe
> (n.GU) dont u also reduce the amount of genes annotated (n.GA) to
> each go term and most likely the amount of interesting genes (n.GI)

I think of the filtering process as part of the definition of
"interesting gene".  So a gene that doesn't pass the non-specific
filtering is by definition not interesting and doesn't make it into
the selected gene list.

Yes, non-specific filtering will reduce the set of genes annotated at
some GO terms, but this is desired IMO.

> - at least in my case some of the genes thats filtered out by IQR
> were classified as significantly differentíally expressed by cyberT
> or limma on the full data set.  So what im asking here is: doesn't
> n.GI and n.GA depend on and change as a function of n.GU? at least
> when u use coarse grained filtering methods it seems that this is
> the case and u might loose some interesting genes and in effect
> throw out the baby with the tub-water - so to speak?
>
>   put in (yet) another way: the chance at GO node X  of getting n.GI 
> [X] interesting genes out of the all annotated genes n.GA[X] at that  
> node by sampling n.GI genes from n.GU at random tells u something  
> about the chance of enrichment at node X. i hope i got that part  
> right? but if n.GI and n.GA depends on n.GU this chance of  
> erinchement might not change drastically when u reduce the gene  
> universe with some coarse grained variance method? or?

I think you are on the right track.  Filtering should change the
results, otherwise, why would you filter?  The question at hand is
whether it is appropriate to include all genes annotated at a given GO
term when testing that term.  There is consensus (I hope) that genes
that were not tested in the experiment should be removed.
Non-specific filtering gives you a chance to remove additional genes
that were tested, but appear to provide no information about the
samples.  My experience is that you get more conservative results by
reducing the gene universe as much as possible.  If you play with
phyper a bit, I suspect you will come to a similar conclusion.

+ seth

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org