[BioC] question about ontoCompare() performance change

Fri Nov 13 04:39:24 CET 2009

Seth,

Thank you for your analysis and the initial pass at a replacement
implementation.  Much appreciated.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel at accelrys.com
Accelrys (SciTegic R&D)             mobile: +1 858 205 3653
10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Vice President, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics

-----Original Message-----
From: Seth Falcon [mailto:sfalcon at fhcrc.org] 
Sent: Thursday, 12 November 2009 1:44 PM
To: Scott Markel
Cc: bioconductor at stat.math.ethz.ch; Agnes Paquet
Subject: Re: [BioC] question about ontoCompare() performance change

Hi again,

On 10/29/09 10:26 AM, Seth Falcon wrote:
> Thanks for the reminder and providing a reproducible example. We will 
> take a look and see if we can understand and provide a fix for the 
> slow down.

The goTools::ontoCompare function as currently coded takes "the long way" at a couple of points when dealing with the GO annotation in the GO.db package.  Unfortunately, I don't see an easy way to make just a few small changes to the existing function.  I believe a significant refactoring is required.

To that end, I've attempted to understand the main goal of the ontoCompare function and to reproduce some of the functionality with a different coding approach.  My intention is to get things started, not to furnish a complete fix.  I have attached an R file containing functions for an alternate implementation.  Here's a summary:

## start out by executing a sample with current goTools code
library("goTools")
library("hgu133a.db")
data(probeID)

system.time(z0 <- ontoCompare(list(L1=affylist[[1]]), "hgu133a",
             method="none"))
Starting ontoCompare...
     user   system  elapsed
1280.047   21.033 1320.269

## Now demonstrate alternate
system.time(zz <- goCompare(affylist[[1]], "hgu133a"))
    user  system elapsed
  14.712   0.116  15.154
Warning message:
In probeToGO(probes, probeType, ontology) :
   removing 15 probe IDs with no mapping to GO

As you can see, the alternate is faster.  *However*, I haven't taken the time to completely re-implement the original function and worse, I get slightly different results.  You can use the following to compare:

zz[["Term"]] = sapply(zz$GO, function(x) Term(GOTERM[[x]]),
                       USE.NAMES=FALSE)

inboth <- intersect(rownames(z0), zz$Term)

zz[["OrigCount"]] <- as.integer(NA)

zz[match(inboth, zz$Term, nomatch=0L), "OrigCount"]
    <- as.integer(z0[inboth, ])

zz[, c("Ontology", "Term", "OrigCount", "Count")]

    Ontology                                      Term OrigCount Count
1        MF                        molecular_function         3    76
19       CC                        cellular_component         2    76
34       BP                        biological_process         5    75
12       CC                                      cell        NA    74
13       CC                                 cell part        74    74
2        MF                                   binding        67    65
27       BP                          cellular process        58    58
21       CC                                 organelle        45    45
36       BP                         metabolic process        44    44
11       MF                        catalytic activity        38    38
23       BP                     biological regulation        12    31
40       BP          regulation of biological process        29    29
15       CC                            organelle part        24    24
44       BP                              localization        13    21
[snip]

I'm hoping that the attached code provides enough of a starting point for the package maintainer or other motivated party to work up a complete solution and understand the differences in the results.

+ seth

--
Seth Falcon
Program in Computational Biology | Fred Hutchinson Cancer Research Center