[BioC] Simple pathway enrichment analysis for gene lists

Paul Shannon paul.thurmond.shannon at gmail.com
Wed Sep 11 18:05:16 CEST 2013


Hi Enrico, 

You ask good questions.

In answer to your immediate need, pasted in below are two versions of a short KEGG pathway enrichment script.  
The first emphasizes simplicity and clarity, and uses only the stock R "dhyper" function along with KEGGREST.
The second version (courtesy of Martin, who is also responsible for some of the approaches I use in my simpler version)
produces a richer data.frame result.

Inasmuch as these examples use KEGGREST, you benefit from the current curated data at KEGG.   As you point out, KEGG.db will only become increasingly out of date.  KEGGREST gets gene/pathway memberships dynamically, avoiding the KEGG prohibition on full downloads.

We offer these two KEGGREST:dhypher/hyperg examples informally, and in hopes of encouraging discussion about what Bioconductor should offer more formally.

It would be illuminating to run the same demo using pathways from Reactome.  Perhaps you could contribute such a demo?

Please let us know what you think.

 - Paul

----- simpler version
library(KEGGREST)
library(org.Hs.eg.db)

     # created named list, eg:  path:map00010: "Glycolysis / Gluconeogenesis" 

pathways.list <- head(keggList("pathway", "hsa"), n=10)

     # make them into KEGG-style human pathway identifiers
pathway.codes <- sub("path:", "", names(pathways.list))

   # for demonstration, just use the first ten pathways
   # not all pathways exist for human, so TODO: tryCatch the
   # keggGet to be robust against those failures

   # subsetting by c(TRUE, FALSE) -- which repeats
   # as many times as needed, sorts through some
   # unexpected packaging of geneIDs in the GENE element
   # of each pw[[n]]

genes.by.pathway <- sapply(pathway.codes,
                           function(pwid){
                               pw <- keggGet(pwid)
                               pw[[1]]$GENE[c(TRUE, FALSE)]
                               })

all.geneIDs <- keys(org.Hs.eg.db)

   # chose one of these for demonstration.  the first (a whole genome random
   # set of 100 genes)  has very little enrichment, the second, a random set
   # from the pathways themsevles,  has very good enrichment

genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", "51043",
                       "55632", "55643", "55743", "55870", "7314",  "56254",
                       "7316",  "144193","784",   "8837",  "1111",  "84706",
                       "200931","169522","5707",  "5091",  "5901",  "55532",
                       "9777")

   # the hypergeometric distribution is traditionally explained in terms of
   # drawing a sample of balls from an urn containing black and white balls.
   # to keep the arguments straight (in my mind at least), I use these terms
   # here also

pVals.by.pathway <- sapply(names(genes.by.pathway),
                        function(pathway) {
                           pathway.genes <- genes.by.pathway[[pathway]]
                           white.balls.drawn <- length(intersect(genes.of.interest, pathway.genes))
                           white.balls.in.urn <- length(pathway.genes)
                           total.balls.in.urn <- length(all.geneIDs)
                           black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
                           total.balls.drawn.from.urn <- length(genes.of.interest)
                           dhyper(white.balls.drawn,
                                  white.balls.in.urn,
                                  black.balls.in.urn,
                                  total.balls.drawn.from.urn)
                           })

print(pVals.by.pathway)

---- richer version

library(KEGGREST)
library(org.Hs.eg.db)

     # created named list, length 449, eg:
     # path:hsa00010: "Glycolysis / Gluconeogenesis"

pathways <- keggList("pathway", "hsa")

     # make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))

   # for demonstration, just use the first ten pathways

demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)

genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
     demo.pathway$GENE[c(TRUE, FALSE)]
      })

all.geneIDs <- keys(org.Hs.eg.db)

   # chose one of these for demonstration.  the first (a whole genome random
   # set of 100 genes)  has very little enrichment, the second, a random set
   # from the pathways themselves,  has very good enrichment in some pathways

set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)

   # the hypergeometric distribution is traditionally explained in terms of
   # drawing a sample of balls from an urn containing black and white balls.
   # to keep the arguments straight (in my mind at least), I use these terms
   # here also

hyperg <- Category:::.doHyperGInternal
hyperg.test <-
    function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
    white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
    white.balls.in.urn <- length(pathway.genes)
    total.balls.in.urn <- length(all.genes)
    black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
    balls.pulled.from.urn <- length(significant.genes)
    hyperg(white.balls.in.urn, black.balls.in.urn,
           balls.pulled.from.urn, white.balls.drawn, over)
}

pVals.by.pathway <-
    t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
    
print(pVals.by.pathway)


On Sep 10, 2013, at 9:23 AM, Enrico Ferrero wrote:

> Hi Paul,
> 
> Thanks for your suggestions.
> I already performed a GO enrichment analysis and I was specifically
> looking at the enrichment of pathways. You also suggest GSEA, but,
> correct me if I'm wrong, I think I would need some additional value to
> rank my list of gene IDs, as the GSEA algorithm requires a ranked list
> to start with.
> I guess at the moment my best option is to use Category and the
> outdated KEGG.db to have an idea of the enriched pathways and then use
> some non-Bioconductor tool (any suggestion?).
> 
> I'm not sure this is the right place, but, on a more general note, may
> I ask if there are any plans to provide better and more integrated
> support for pathway enrichment analysis in Bioconductor?
> For example, would it be possible to build something similar to an up
> to date KEGG.db using KEGG's REST API? After all, packages such as
> gage and ROntoTools get their information from KEGG in that way.
> Alternatively, while not as comprehensive as KEGG, Reactome is
> completely open source and open access and I would be surprised if
> there wasn't a more fruitful way to integrate it with the core
> Bioconductor tools an packages.
> 
> I'm probably being naive here, but I'm afraid I don't fully understand
> what is the general consensus of the Bioconductor community on current
> status and future directions of pathway analysis tools.
> 
> Thank you.
> Best,
> 
> On 9 September 2013 16:25, Paul Shannon <paul.thurmond.shannon at gmail.com> wrote:
>> Hi Enrico,
>> 
>> reactome.db is best described, I believe, as simply a bioc rendering of Reactome's sql database -- Marc, please correct me if I am wrong.
>> 
>> There is thus a data representation obstacle when using reactome.db: molecular relations are described, with pathway/gene mappings not so easy to get at.
>> In addition, and despite Reactome's many strengths, its coverage is incomplete. The canonical wnt pathway, for instance, is (at my last check) not included.
>> 
>> If you have a list of geneIDs, exploratory analysis can usefully start out with both GO enrichment, KEGG enrichment, and GSEA. Though the information in KEGG.db has not been updated in a couple of years, the information there is still very useful for exploratory data analysis. Any enrichments you discover using these assorted gene/ctaegory associations may lead you to a close study of particular functions or pathways, and it is this point that you may wish to get the latest and most specific information via KEGGREST and Reactome (and, with our next release) the new PSICQUIC package (see http://code.google.com/p/psicquic/).
>> 
>> I hope this helps. Let us know if it falls short, or if new questions arise.
>> 
>> - Paul
>> 
>> 
>> On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote:
>> 
>>> Dear list,
>>> 
>>> Can anybody suggest how to perform a simple pathway enrichment
>>> analysis starting from a list of gene IDs?
>>> 
>>> I know about the gage and ROntoTools packages that use KEGGREST to
>>> retrieve an up to date version of the KEGG database, but, as far as I
>>> understand, they require a microarray experiment as input (or at least
>>> fold changes and pvalues).
>>> 
>>> Since this time around I'm not starting from a microarray experiment
>>> but I just have a gene list, I'm looking for a way to perform pathway
>>> enrichment analysis using a simple numerical method such as Fisher's /
>>> hypergeometric test.
>>> 
>>> I know the Category package still provides a KEGGHyperG class (which
>>> would be perfect!), but the results are based on the outdated version
>>> of KEGG (via KEGG.db, I guess).
>>> 
>>> Are there any good alternatives available out there? Would it be
>>> possible to use reactome.db in conjunction with the Category/GOstats
>>> functions for example?
>>> 
>>> Thank you!
>>> Best,
>>> 
>>> --
>>> Enrico Ferrero
>>> Department of Genetics
>>> Cambridge Systems Biology Centre
>>> University of Cambridge
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> -- 
> Enrico Ferrero
> Department of Genetics
> Cambridge Systems Biology Centre
> University of Cambridge



More information about the Bioconductor mailing list