[BioC] Gene enrichment question

Wed Aug 15 17:02:16 CEST 2012

On 15.08.2012 14:51, Aliaksei Holik wrote:
> Dear listers,
>
> Apologies if my question is not strictly related to Bioconductor,
> though one never knows, maybe there's a package that does what I 
> need.
>
> I am analysing a list of differentially expressed genes from an
> Illumina microarray. In particular I'm trying to compare the list of
> differentially expressed genes to an existing list of genes
> preferentially expressed in the stem cell population (stem cell
> signature). When I do so, 10% of DE genes belong to the stem cell
> signature. What I'd like to do now is to find out, how likely that
> would happen by chance, i.e. put a p value on it.
>
> At the moment I know:
> There're 17119 unique genes in my dataset.
> Of them 86 are differentially expressed.
>
> The stem cell signature contains 510 genes.
> It is combined from several platforms, which makes it hard to
> establish the total number of unique genes, but it's at least 20819
> (the size of the largest platform).
>
> There are 9 overlapping genes between DE genes and the stem cell 
> signature.
>
> So I wonder:
>
> 1) If there's an accepted way to calculate a p value using these
> data. For instance could I run a like of a chi squared test? E.g. 
> stem
> cell specific genes represent 510/20819=2.4% of total dataset. So
> expected number of the stem cell genes in my DE genes would be
> 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2
> expected.

Hypergeometric test?

> phyper(9-1,86,17119-86,510,lower.tail=F)
[1] 0.001035456

For the total number of genes I used your lower estimate to be 
conservative. To be completely correct I think you would need to remove 
any of the 510 genes that are not in your 17,119 gene dataset. That will 
only boost the P value though (as they cannot be called DE if they are 
not in your dataset) and it is already 'significant' by most journals 
standards.

-- 
Alex Gutteridge