[BioC] Re: GoHyperG

Nicholas Lewin-Koh nikko at hailmail.net
Thu Dec 23 18:11:10 CET 2004

Hi Sean,
My answer is below,
On Thu, 23 Dec 2004 11:29:45 -0500, "Sean Davis" <sdavis2 at mail.nih.gov>
> On Dec 23, 2004, at 10:52 AM, Nicholas Lewin-Koh wrote:
> > Hi Sean,
> > In this situation I would hope it is a one sided test. I had this
> > same discussion with a colleague who wanted the same thing. I don't
> > think testing for under-representation means anything. Think about
> > the context, one is doing recursive sampling of a finite of a finite
> > population for which there are two sources of bias, what is represented
> > in the database or on the chip, and what is annotated on the chip.
> > Further you are testing at each node the discrepency from random,
> > as you go down the DAG zero becomes more and more probable, you can
> > think
> > of it as doing a mark-recapture study on your genes. This problem is
> > exacerbated
> > by the sampling bias. Finally, a last complication is that test is
> > further biased by your ability to detect differentially expressed 
> > genes.
> > At least if you detect over-representation you can argue for a strong
> > signal.
> I'm being a bit dense, but suppose I have 10000 genes on a chip 
> (annotated in ontology Y), 1000 of which are annotated as category X; I 
> find 1000 differentially-expressed genes (annotated in ontology Y) from 
> that chip, but only 12 are from category X.  Is that not interesting to 
> know about?

I'd say probably not. More likely the 12 genes represent
somewhere down the DAG, or they are due to genes that overlap categories
and are part of another set of groups that is expressed. 

If you do detect under representation how would you interpret it? I
see how there would be a biological interpretation (mind you I am not a
unless you had a distinct hypothesis about a group that should be
expressed under the treatment,
in which case this is probably the wrong approach and something like
Jelle Goeman's global
test would be much more appropriate.

> As for finding zeros, as it becomes more probable as one moves down the 
> DAG, of course finding "underrepresented" groups becomes prohibitively 
> difficult, but for large categories is certainly possible.  As for 
> biases, I'm not sure that I agree that ability to detect 
> differentially-expressed genes is a source of "bias".  It is certainly 
> a limitation, but I don't think a bias.  And I'm not sure what 
> "sampling bias" might be present?

Look at the parameters in the hypergeometric. The idea behind the
is sampling from a finite population. We have a finite population N but,
is conditional on the probes being annotated and represented on the
chip. So from 
that perspective we are conditionally unbiased. But at each level of
refinement in go 
we can expect that annotation will be more variable, so we are "losing"
genes as 
the functions become more refined. It is like dropping marbles through
leaky pipes
and trying to estimate the total by what drops through at the bottom.

Anyway I'm drinking eggnog as I write so I may not be making as much
sense as I think I am.

A merry christmas to you.


> Thanks for the food for thought.
> Sean
> >>
> >> Message: 4
> >> Date: Wed, 22 Dec 2004 11:02:55 -0500
> >> From: Sean Davis <sdavis2 at mail.nih.gov>
> >> Subject: [BioC] GoHyperG
> >> To: Bioconductor <bioconductor at stat.math.ethz.ch>
> >> Message-ID: <F0EE8E4B-5432-11D9-ACCB-000D933565E8 at mail.nih.gov>
> >> Content-Type: text/plain; charset=US-ASCII; format=flowed
> >>
> >> Just a quick question--are the p-values from gohyperg one- or
> >> two-sided?  I have a collaborator who would like to use it to 
> >> determine
> >> underrepresented ontology categories.
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> >>

More information about the Bioconductor mailing list