[BioC] question about ontoCompare() performance change

Seth Falcon sfalcon at fhcrc.org
Thu Nov 12 22:43:43 CET 2009


Hi again,

On 10/29/09 10:26 AM, Seth Falcon wrote:
> Thanks for the reminder and providing a reproducible example. We will
> take a look and see if we can understand and provide a fix for the slow
> down.

The goTools::ontoCompare function as currently coded takes "the long 
way" at a couple of points when dealing with the GO annotation in the 
GO.db package.  Unfortunately, I don't see an easy way to make just a 
few small changes to the existing function.  I believe a significant 
refactoring is required.

To that end, I've attempted to understand the main goal of the 
ontoCompare function and to reproduce some of the functionality with a 
different coding approach.  My intention is to get things started, not 
to furnish a complete fix.  I have attached an R file containing 
functions for an alternate implementation.  Here's a summary:

## start out by executing a sample with current goTools code
library("goTools")
library("hgu133a.db")
data(probeID)

system.time(z0 <- ontoCompare(list(L1=affylist[[1]]), "hgu133a",
             method="none"))
Starting ontoCompare...
     user   system  elapsed
1280.047   21.033 1320.269

## Now demonstrate alternate
system.time(zz <- goCompare(affylist[[1]], "hgu133a"))
    user  system elapsed
  14.712   0.116  15.154
Warning message:
In probeToGO(probes, probeType, ontology) :
   removing 15 probe IDs with no mapping to GO

As you can see, the alternate is faster.  *However*, I haven't taken the 
time to completely re-implement the original function and worse, I get 
slightly different results.  You can use the following to compare:

zz[["Term"]] = sapply(zz$GO, function(x) Term(GOTERM[[x]]),
                       USE.NAMES=FALSE)

inboth <- intersect(rownames(z0), zz$Term)

zz[["OrigCount"]] <- as.integer(NA)

zz[match(inboth, zz$Term, nomatch=0L), "OrigCount"]
    <- as.integer(z0[inboth, ])

zz[, c("Ontology", "Term", "OrigCount", "Count")]

    Ontology                                      Term OrigCount Count
1        MF                        molecular_function         3    76
19       CC                        cellular_component         2    76
34       BP                        biological_process         5    75
12       CC                                      cell        NA    74
13       CC                                 cell part        74    74
2        MF                                   binding        67    65
27       BP                          cellular process        58    58
21       CC                                 organelle        45    45
36       BP                         metabolic process        44    44
11       MF                        catalytic activity        38    38
23       BP                     biological regulation        12    31
40       BP          regulation of biological process        29    29
15       CC                            organelle part        24    24
44       BP                              localization        13    21
[snip]

I'm hoping that the attached code provides enough of a starting point 
for the package maintainer or other motivated party to work up a 
complete solution and understand the differences in the results.

+ seth

-- 
Seth Falcon
Program in Computational Biology | Fred Hutchinson Cancer Research Center
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: probeToGO.R
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20091112/e24e1547/attachment.pl>


More information about the Bioconductor mailing list