[BioC] GeneSetCollection from GSEA

Wed Feb 27 18:15:07 CET 2008

It's funny, Paul asked me this in a private email. I gave him a
complicated solution and suggested he ask on the mailing list so all
would benefit. And of course Sean offers an excellent simple solution,
to which I add a bit below...

"Sean Davis" <sdavis2 at mail.nih.gov> writes:

> On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schröder
> <pschrode at alumni.unav.es> wrote:
>> Hello all,
>>
>>  I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow.
>>
>>  If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection?
>>
>>  I tried:
>>  getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp")
>
> Hi, Paul.
>
> You need to specify the actual uri of the file that you want to get:
>
> 

These are all gene sets in the Broad collection. If you wanted just
those with 'c2' designation then you might do something like

> isC2 <- sapply(broadsets,
+                function(x) bcCategory(collectionType(x))) == "c2"
> c2collection <- broadsets[isC2]
> c2collection
GeneSetCollection
  names: CHESLER_D6MIT150_TRANSCRIPTION_FACTORS_GLOCUS, DEATHPATHWAY, ..., ET743_RESIST_DN (1687 total)
  unique identifiers: MSX3, BACH1, ..., RUSC2 (16966 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: BroadCollection (1 total)

Reassuringly, the number of gene sets identified (1687) is the same as
from the Broad.

For the hard-core, my off-list response to Paul was along the lines of

1. download msigdb_v2.1.xml by hand (because I didn't pay attention to
the URL my copy came from, because the Broad people ask you to
register before using it so it might be nice to let them know,
implicitly, that their data is uesful and to whom, and because I want
to reference it on more than one occaision so might as well avoid the
bandwidth download; mostly though because I forgot where to get it
from directly).

2. define the following function

getLocalBroadSets <- function(file, category) {
    xpathq <- paste('//GENESET[@CATEGORY_CODE="',
                    category,
                    '"]', sep="")
    handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL)
    GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler))
}

3. Evaluate the commands

> fl <- '~/tmp/msigdb_v2.xml'
> getLocalBroadSets(fl, "c1")

to which Paul rightly asks for an explanation of the cryptic function:

getLocalBroadSets <- function(file, category) {

    ## 'file' is an XML file that we'll read in (parse) using the XML
       package

    xpathq <- paste('//GENESET[@CATEGORY_CODE="',
                    category,
                    '"]', sep="")
    handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL)
    GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler))
}

Here's a more fully commented version. The key thing is that you can
reformulate the query to very efficiently extract complicated subsets
of the msigdb_v2.1 xml file.

getLocalBroadSets <- function(file, category) {
    ## 'file' is a text file formatted as xml. We'll read it in to R
    ## using the XML package. You can look at the file and you'll see
    ## something like
    ## 
    ## <?xml version="1.0" encoding="UTF-8"?>
    ## <MSIGDB NAME="chr16q24" VERSION="1" BUILD_DATE="1">
    ##   <GENESET STANDARD_NAME="chr5q23" ...
    ##
    ## Things like 'MSIGDB', 'GENESET' are called each a
    ## 'node'. things like 'STANDARD_NAME' is an 'attribute'.
    ##
    ## It's possible using the amazing XML package and the underlying
    ## libxml library, to treat the XML as a kind of data base that
    ## you can query against. The syntax of queries is defined by the
    ## XPATH query language, and is really quite simple (see the
    ## abbreviated syntax at
    ## 
    ## http://www.w3.org/TR/xpath#path-abbrev
    ##
    xpathq <- paste('//GENESET[@CATEGORY_CODE="',
                    category,
                    '"]', sep="")

    ## we construct a query that says 'select any GENESET node with
    ## attribute CATEGORY_CODE equal to the value of the variable
    ## 'category'
    ##
    ## This next line is best treated as black magic, but basically
    ## when we find a node from our query, we want to process it into
    ## a GeneSet. In the XML package we do this by providing a
    ## function that's invoked whenever the object is
    ## encountered. Often we want our handler to remember things
    ## (e.g., the location of the file that is being parsed) for each
    ## time it gets invoked. One way of doing this is by using R's
    ## 'lexical scope' to define a function that returns a
    ## function. The returned function (type 'handler' after the
    ## definition below) parses the node into a gene set, but has
    ## access to the variables defined in the function that defined it
    ## (.BroadXMLNodeToGeneSet_factory). As I said, best to treat as
    ## black magic

    handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL)

    ## We'll now parse the file form XML, using our xpathq to identify
    ## the nodes to extract, and our handler to convert each node from
    ## XML to a GeneSet. We'll wrap the entire object into a
    ## GeneSetCollection, instead of just returning a list

    GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler))
}

> That should do it for you.
>
> Sean
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793