[BioC] "topGOdata" object: How to supply gene scores with a predefined list of genes

Mon Aug 31 19:19:02 CEST 2009

Hi Scott,

sorry for the late reply. I hope you are still interested in the
problem you raised.

Your problem is very interested and I must admit I didn't thought
about it when I wrote the topGO code. And as you already mentioned
there is no trivial way to use more than one score in the gene
selection function. However, one could do this with a bit of
hacking... and I see two solutions to your problem.

1) If you want to be able to compare the results for a KS test with
the ones from some count based test like Fisher's exact test, and for
the count based tests you want to use a section criteria involving
more than one gene score, then the most easiest way to do it is to
build two "topGOdata" objects, one for the KS test, thus containing
only the gene score and a dummy selection function and the other one
containing a predefined list of interesting genes. Note that once a
"topGOdata" object is instantiated, one can update the gene score or
the predefined list of genes via the "updateGenes" method. Also very
important to note here is that you need to have the same gene universe
for the same objects. It will be a bit strange to compare results
which are obtained using a different gene universe. Once you have the
results, you can use the "GenTable" function to generate a summary of
the results. You could feed any of the two "topGOdata" objects to this
function (the "topGOresult" object do not directly depend on it).

The disadvantage of this approach is the redundancy in the two
"topGOdata" objects. However, it should be strait forward to use.

2) If you plan to use KS test and cont based tests, then one can do
the following trick. The main observation is that the KS test used
here is a ranked based test and thus only the ranking of the genes,
based on their p-value, counts. Therefore, one can transform the
vector of p-values into a vector of ranks, before building the
"topGOdata" object. Now, since the ranks are unique (and we can
compute them in this way), we can use them as to index a vector/table.
We can build a table, similar with "input" in your first example, in
which we can store multiple information on the genes. Then, if you
build the "topGOdata" object using these ranks as scores, one could
compute of course the KS test, and also define a selection function
which will have access the data in the table using the ranks as
indices. Of course, the table in this case needs to be a global
variable. Once the "topGOdata" object is build it is very easy to
update it with a new gene selection function using the
"geneSelectionFun<-" method. Note that this method works only with
ranked based test statistics and it will not generalize to tests which
use the score itself easily! Bellow, you will find a chunk of code
which should exemplify the second approach on the ALL data.

I hope you'll find the answer in one of the solutions describe. Please
let me know if you have further questions.

Regards,
Adrian

------------------------------------------------------------------------------

library(topGO)

library(ALL); data(ALL)
affyLib <- paste(annotation(ALL), "db", sep = ".")
library(package = affyLib, character.only = TRUE)

library(genefilter)
f1 <- pOverA(0.25, log2(100))
f2 <- function(x) (IQR(x) > 0.5)
ff <- filterfun(f1, f2)

## select genes based on the defined filters
selGenes <- genefilter(ALL, ff)

## obtain the list of differentially expressed genes
classLabel <- as.integer(sapply(ALL$BT, function(x) return(substr(x,
1, 1) == 'T')))
geneList <- getPvalues(exprs(ALL), classlabel = classLabel)

## select 1000 random genes
randSel <- integer(length(geneList))
randSel[sample(length(geneList), 1000)] <- as.integer(1)
names(randSel) <- names(geneList)

## transform the p-values into ranks
scoreIndex <- rank(geneList, ties.method = "first")

## generate a data frame containing all the information you need to
select interesting genes
GENE.INFO <- data.frame(p.value = geneList[names(geneList)],
                        filtered = selGenes[names(geneList)],
                        random = randSel[names(geneList)])

## index the data frame with the ranks
GENE.INFO <- GENE.INFO[order(scoreIndex), ]

## define a function to select the "significant" genes
mySelGenes <- function(allScore) {
  return(GENE.INFO[allScore, "p.value"] < 0.01)
}

## select the genes that passed the filter criteria and have a p-value
lower than 0.01
mySelGenes.2 <- function(allScore) {
  return(GENE.INFO[allScore, "p.value"] < 0.01 & GENE.INFO[allScore,
"filtered"])
}

## how many differentially expressed genes are:
sum(mySelGenes(scoreIndex))

sum(mySelGenes.2(scoreIndex))

## build the topGOdata class
GOdata <- new("topGOdata",
              ontology = "BP",
              allGenes = scoreIndex,
              geneSel = mySelGenes,
              nodeSize = 10,
              annot = annFUN.db,
              affyLib = affyLib)

## display the GOdata object
GOdata

## we can change the selection criteria, by updating the object with
mySelGenes.2
geneSelectionFun(GOdata) <- mySelGenes.2
## same gene universe but different number of interesting genes ...
GOdata

## we can use both KS and Fisher's exact test on this object:
geneSelectionFun(GOdata) <- mySelGenes
resFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
resKS <- runTest(GOdata, algorithm = "classic", statistic = "KS")

## Fisher's exact test accounting for the gene filter
geneSelectionFun(GOdata) <- mySelGenes.2
resFisher.2 <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

## we can see some summary for each result
resFisher
resFisher.2
resKS

## we can generate the table which summarises the results
## but the stats. in the table will depend on the current selection
function stored in the GOdata object.
allRes <- GenTable(GOdata, classic = resFisher, KS = resKS, classic.2
= resFisher.2,
                   orderBy = "classic.2", ranksOf = "classic", topNodes = 20)

allRes

------------------------------------------------------------------------------

R version 2.9.0 (2009-04-17)
i686-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] multtest_2.1.1      genefilter_1.24.2   hgu95av2.db_2.2.12
 [4] ALL_1.4.5           topGO_1.12.0        SparseM_0.80
 [7] GO.db_2.2.11        RSQLite_0.7-2       DBI_0.2-4
[10] AnnotationDbi_1.6.1 Biobase_2.4.1       graph_1.22.2

loaded via a namespace (and not attached):
[1] annotate_1.22.0 grid_2.9.0      lattice_0.17-25 MASS_7.2-48
[5] splines_2.9.0   survival_2.35-4 tools_2.9.0     xtable_1.5-5