[BioC] domainsignatures with non-human KEGG pathways

Robert Castelo robert.castelo at upf.edu
Wed Dec 16 10:57:09 CET 2009


thanks for your help with this!!

i have a further technical question regarding the domainsignatures
package. looking at the vignette, and in particular the volcano plot of
the example of its usage with KEGG, i see that similarity values of 0
may get p-values smaller than 0.05 which could be regarded as
significant (as one does for instance with GO enrichement analysis) but
however it is clear that a 0 similarity is biologically not significant.

i've tried to understand how this p-value is calculated by looking at
the paper and on page 5 says that "the p-value is defined as the
fraction of data points in this empirical distribution larger than the
similarity measurement of the original gene list", where the empirical
distribution refers to sampling gene lists uniformly at random and
compute the similarities to form the null distribution.

if i look at the code that calculates this p-value, these are, i think,
the two important lines within the function compSimilarities:

  ## compute pvalues from sample distributions
  pvals <- mapply(function(x,y) sum(x>y)/n, dists, setDists)
  pvals[pvals==0] <- 1/n

so, in fact, where i saw 0 p-values in the plot they're in fact 1/n (n
is the number of simulations) i guess to correctly assign an upper bound
to the p-value since it comes from a finite number of simulations.

in order to sort out the statistical versus biological significance
problem, would it make sense to slightly change the definition of the
p-value to "the faction of data points in this empirical distribution
larger OR EQUAL than the similarity measurement" ?

thus, changing the first of the two previous lines of code to

  pvals <- mapply(function(x,y) sum(x>=y)/n, dists, setDists)


cheers,
robert.

On Wed, 2009-12-16 at 09:55 +0100, florian.hahne at novartis.com wrote:
> 
> Hi Robert, 
> the initial idea in the package was to keep the generation of the
> ipDataSource object separated from the actual gene set testing. Of
> course we don't want to restrict ourselves to human only, but we also
> don't want to be restricted to a particular type of pathway
> collection. Hence the getKEGGData function was only suposed to be a
> little pointer how one could generate these objects, e.g. by tapping
> into Biomart. I do agree that this little change could make the
> function a bit more useful, and will implement your suggestion as soon
> as possible. 
> Regarding the tcltk warning: this just means that you haven't compiled
> R with tcltk support, and you don't get a graphical progress bar.
> Should you ever need tcltk support in the future (it is not mandatory
> for the proper functioning of the domainsignatures package) you should
> install both the tcl and the tk libraries, along with their header
> files (often in separate -dev rpms, should you use a Linux package
> manager), and recompile R. 
> Thanks for the input, 
> Florian
> Florian Hahne, PhD
> Novartis Institutes for Biomedical Research
> TS/PCS/iTox
> CHBS, WKL-136.1.76, CHBS, WKL-136.1.75
> Novartis Pharma AG, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127
> Email : florian.hahne at novartis.com
> 
> 
> 
> 
> 
> 
> Robert Castelo
> <robert.castelo at upf.edu> 
> Sent by:
> bioconductor-bounces at stat.math.ethz.ch 
> 
> 15.12.2009 12:48 
> 
> 
>                To
> bioconductor at stat.math.ethz.ch 
>                cc
> Florian Hahne
> <fhahne at fhcrc.org> 
>           Subject
> [BioC]
> domainsignatures
> with non-human
> KEGG pathways
> 
> 
> 
> 
> 
> 
> 
> 
> dear list and, particularly, dear domainsignatures package maintainers
> (Florian?),
> 
> i was trying to use the package domainsignatures from the current
> BioC-devel version (see my sessionInfo at the end of this message) to
> test for the enrichment of a gene list throughout the collection of
> available KEGG pathways in mouse and found that the main function that
> collects the KEGG data is tailored to be employed with human data
> only.
> more concretely, the function 'getKEGGdata' contains the following
> hardcoded line in its source:
> 
>    ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
> 
> since this function already provides the possibility of restricting
> the
> set of pathways to be tested through the 'pathways' argument i guess
> that it is not the intention of the package to restrict itself to
> human.
> so, i'd like to suggest the maintainers to try to make the function
> general for any organism for which KEGG and ensembl provide the
> necessary data.
> 
> to get inmediately going i've made a quick dirty fix which i paste
> below, just in case it may be useful.
> 
> btw, the package function 'gseDomain' outputs in my R-devel
> installation
> the following warning after being called:
> 
> Warning message:
> In progress(message = mess, sub = sub) : Need tcltk for the status bar
> 
> which i guess has to do with the fact that i'm missing some software
> component in my linux box because loading 'tcltk' gives the following
> messsage:
> 
> library(tcltk)
> Error in firstlib(which.lib.loc, package) : 
>  Tcl/Tk support is not available on this system
> Error in library(tcltk) : .First.lib failed for 'tcltk'
> 
> searching for documentation about how to properly install 'tcltk' i've
> found out that this package seems to be removed from CRAN, see
> http://cran.r-project.org/web/packages/tcltk/index.html
> 
> and i've seen another package called 'tcltk2' which sounds like a
> replacement for 'tcltk'. i just wanted to comment this in case it may
> be
> an issue to consider for the package maintainers.
> 
> thanks!!!
> robert.
> 
> myGetKEGGdata <- function(universe=NULL, pathways=NULL,
> ensemblMart=NULL) { ## add ensemblMart argument
>    op <- options(warn = -1)
>    on.exit(options(op))
>    if (class(try(readLines("http://www.bioconductor.org"), silent =
> TRUE)) == 
>        "try-error") 
>        stop("Active internet connection needed for this function")
>    options(op)
>    if (!is.null(pathways)) 
>        hKEGGids <- pathways
>    else hKEGGids <- grep("^hsa", ls(KEGGPATHID2EXTID), value = TRUE)
>    path2Genes <- mget(hKEGGids, KEGGPATHID2EXTID)
>    hKEGGgenes <- union(universe, unique(unlist(path2Genes, use.names =
> FALSE)))
>    hKEGGgenes <- hKEGGgenes[!is.na(hKEGGgenes)]
>    if (is.null(ensemblMart)) ## if no specific ensembl mart is
> provided
> then use human
>      ensemblMart <- "hsapiens_gene_ensembl"
>    ensembl <- useMart("ensembl", dataset = ensemblMart)
>    tmp <- getBM(attributes = c("entrezgene", "interpro"), filters =
> "entrezgene", 
>        values = hKEGGgenes, mart = ensembl)
>    gene2Domains <- split(tmp$interpro, tmp$entrezgene, drop = FALSE)
>    missing <- setdiff(hKEGGgenes, names(gene2Domains))
>    gene2Domains[missing] <- ""
>    hKEGGdomains <- unique(unlist(gene2Domains))
>    hKEGGdomains <- hKEGGdomains[!is.na(hKEGGdomains)]
>    path2Domains <- lapply(path2Genes, function(x, gene2Domains)
> unique(unlist(gene2Domains[x], 
>        use.names = FALSE)), gene2Domains)
>    dims <- c(pathway = length(hKEGGids), gene = length(hKEGGgenes), 
>        domain = length(hKEGGdomains))
>    return(new("ipDataSource", genes = hKEGGgenes, pathways =
> hKEGGids, 
>        domains = hKEGGdomains, gene2Domains = gene2Domains, 
>        path2Domains = path2Domains, dims = dims, type = "KEGG"))
> }
> 
> sessionInfo()
> R version 2.11.0 Under development (unstable) (2009-10-06 r49948) 
> x86_64-unknown-linux-gnu 
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets
> methods  
> [8] base     
> 
> other attached packages:
> [1] domainsignatures_1.7.0 biomaRt_2.3.0
> prada_1.23.0          
> [4] rrcov_1.0-00           pcaPP_1.7
> mvtnorm_0.9-8         
> [7] robustbase_0.5-0-1     RColorBrewer_1.0-2
> KEGG.db_2.3.5         
> [10] RSQLite_0.7-3          DBI_0.2-4
> AnnotationDbi_1.9.2   
> [13] Biobase_2.7.2         
> 
> loaded via a namespace (and not attached):
> [1] MASS_7.3-4    RCurl_1.3-0   XML_2.6-0     stats4_2.11.0
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list