[BioC] Gene enrichment question

Wu, Di dwu at fas.harvard.edu
Thu Aug 16 03:28:01 CEST 2012

Hi Aliaksei,

I will ask two questions before I give any suggestions.

I am thinking what the suitable gene set testing methods are in your case. 

First, depending on the biological knowledge, will you consider the stem cell signature genes as a gene set or the differential expressed genes as a gene set if you have to choose one? 

Second, do you have the expression data for the two datasets? In your latest email, you may have the expression data.

As you may know, in your case, we need a gene set from one data set and expression data from another study to do a gene set test.

If we take the stem cell signature genes as the gene set, we will need to have the expression data in your study. With them, our gene set testing methods "ROAST", "CAMERA" and "ROMER" in limma can work well. You choose which you want to use depending on your statistical hypothesis. I may suggest starting with ROAST. The R code of ROAST should be straightforward. I will be happy to help if you have questions about using it.

"ROAST: rotation gene set tests for complex microarray experiments"

On the other hand, if you don't have (don't want to use) any expression data at this moment. But do you have the t statistic (or log fold change) results from the analysis genome-wide? If we still take the stem cell signature genes as the gene set, you can use "geneSetTest" (usually rank based p value) in limma to do the test with the genome-wide t statistics or log fold change from your own study. We have mentioned that this may be a bit optimistic in our "CAMERA" paper. You will still be able to draw conclusion if you see a very significant p value.

"Camera: a competitive gene set test accounting for inter-gene correlation"

The above gene set tests (like GSEA) only require you have one gene set, and they don't need the cutoff to make the other gene list.

In another case, you may not even have the genome-wide t statistics or log fold change, hypergeometric test or Fisher's test might be the only options, as others suggested. The R function "phyper" may help.


Di Wu
Postdoctoral fellow
Harvard University, Statistics Department
Harvard Medical School
Science Center, 1 Oxford Street, Cambridge, MA 02138-2901 USA

From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Aliaksei Holik [salvador at bio.bsu.by]
Sent: Wednesday, August 15, 2012 9:51 AM
Cc: bioconductor at r-project.org
Subject: [BioC] Gene enrichment question

Dear listers,

Apologies if my question is not strictly related to Bioconductor, though
one never knows, maybe there's a package that does what I need.

I am analysing a list of differentially expressed genes from an Illumina
microarray. In particular I'm trying to compare the list of
differentially expressed genes to an existing list of genes
preferentially expressed in the stem cell population (stem cell
signature). When I do so, 10% of DE genes belong to the stem cell
signature. What I'd like to do now is to find out, how likely that would
happen by chance, i.e. put a p value on it.

At the moment I know:
There're 17119 unique genes in my dataset.
Of them 86 are differentially expressed.

The stem cell signature contains 510 genes.
It is combined from several platforms, which makes it hard to establish
the total number of unique genes, but it's at least 20819 (the size of
the largest platform).

There are 9 overlapping genes between DE genes and the stem cell signature.

So I wonder:

1) If there's an accepted way to calculate a p value using these data.
For instance could I run a like of a chi squared test? E.g. stem cell
specific genes represent 510/20819=2.4% of total dataset. So expected
number of the stem cell genes in my DE genes would be 86x2.4%=2. So my
chi squared test would be based on 9 observed vs 2 expected.

2) Or do I have to generate a geneset based on the stem cell signature
and go through GSEA algorithms to calculate enrichment and significance.

Any pointers in the right direction would be much appreciated.

Many thanks for your time and help!


Bioconductor mailing list
Bioconductor at r-project.org
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list