[BioC] mouse4302.db vs mouse 4302

James W. MacDonald jmacdon at med.umich.edu
Thu Apr 17 17:38:33 CEST 2008


Hi Gabor,

Gábor Csárdi wrote:
> Dear users & developers,
> 
> i'm sure that there were discussions about the pros and cons of switching to
> .db packages before the actual decision
> was made. Personally i don't really understand why it is not possible to
> have the old style packages. If anyone
> can point me out the pro's of having the .db packages, or just forward me to
> the relevant arguments (assuming they
> are public), i would be really grateful.

It's not possible to have the old style packages because the SQLite 
packages are better, and it is entirely too much work to maintain 
duplicate annotation packages plus the code base to make both sets.

> 
> For me, environment-based packages are much better, because they're faster,
> and i have enough RAM to load
> everything at the same time. Here is a little demonstration:
> 
>> library(org.Hs.eg.db)
>>
>> egGO <- new.env(hash=TRUE)
>>
>> all <- mget( ls(org.Hs.egGO), org.Hs.egGO )
>>
>> for (i in seq(all)) {assign(names(all)[i], all[[i]], envir=egGO) }
>>
>> sam <- sample(ls(org.Hs.egGO), 5000)
>> system.time( tmp <- mget(sam, org.Hs.egGO) )
>    user  system elapsed
>   0.808   0.001   0.811
>> system.time( tmp <- mget(sam, egGO) )
>    user  system elapsed
>   0.022   0.000   0.023
> 
> You could argue that the slower query is still under one second, and it does
> not matter, but it actually does matter if you have lots
> of queries.  For some of my scripts the difference is something like 1 day
> agains 10 minutes. E.g. GO enrichment calculation
> can be speeded up about 100 times if one uses the environment based packages
> and tailors the original hyperGTest for GO
> a bit.

In this example the environment based packages are faster, but they are 
very limiting. For instance, they are only good for one-to-one or 
one-to-many relationships. If you want the reverse mapping of say, a 
GOTERM to Entrez Gene IDs, this is difficult to do programmatically 
(whereas this is simple with the SQLite packages, using revmap()). In 
addition, the env based packages don't allow you to do queries that use 
two envs at once (e.g., you can't get GO terms for a set of Affy IDs 
without doing sequential queries, which you can do using the SQLite 
packages).

Additionally, just because you happen to have the compute power and RAM 
to use a large environment doesn't mean that the average BioC user does.

If you want the SQLite-based packages to be close to the same speed as 
the environment packages, you probably won't be able to use the 
convenience functions that were written to allow naive users to get 
similar results with the SQLite packages. Things like get(), mget(), etc 
will probably always be slower than a directed SQL query.

 > library(GO)
 > sam <- sample(ls(GOTERM), 5000)
 > system.time(mget(sam, GOTERM))
    user  system elapsed
    0.02    0.00    0.02

 > library(GO.db)

 > sam <- sample(ls(GOTERM), 5000)
 > system.time( mget(sam, GOTERM))
    user  system elapsed
    5.25    0.02    5.27

 > samvec <- paste(paste("'", sam, "'", sep = ""), collapse = ",")
 > sql <- paste("SELECT term FROM go_term WHERE go_id IN (",samvec,");")
 > system.time(dbGetQuery(GO_dbconn(), sql))
    user  system elapsed
    0.09    0.01    0.11

See the vignette in the devel version of AnnotationDbi for more 
information about these packages.

Best,

Jim



> 
> It is interesting to see that in R, the direction was "storing everything in
> memory" instead of keeping the data on the disk,
> as many other statistical software packages do; here, however, the direction
> is the opposite. You can also argue that these
> data packages are getting bigger and bigger each year, but so do the size of
> the RAM in the computers. For me it
> is not obvious which curve is steeper.
> 
> Finally, it might happen that the .db packages are just as fast as the old
> ones, only i'm not using them the right way.
> If this is the case, then i apologize, please show me the right way.
> 
> Best Regards,
> Gabor
> 
> On Wed, Apr 16, 2008 at 11:36 PM, Marc Carlson <mcarlson at fhcrc.org> wrote:
> 
>> Hi Richard,
>>
>> In general you want to use the .db version of these packages as it is
>> the newer format.
>>
>> The older format is really only provided for people who have lots of
>> old-style package dependencies that they still need to work out.  In a
>> couple of weeks, we should have a new version of these packages for you
>> to explore.  But the older format will be deprecated which means we are
>> not planning to make it again after the upcoming release...
>>
>>
>>    Marc
>>
>>
>>
>>
>> Richard Friedman wrote:
>>> Dear bioconductor list,
>>>
>>>       The mouse4302.db annotation package is listed as Version 2.02 of
>> the
>>> annotation
>>> for mouse 4302 and was packaged Mon Oct 22 10:15:57.
>>>
>>>       The mouse4302 annotation package is listed as Version 2.01 and was
>>> packaged
>>>   Thu Oct 11 11:31:47 2007.
>>>
>>>       Should I there use mouse4302.db as the more recent and
>> authoritative
>>> of the 2 packages?
>>> Or is it simply one with different contents?
>>>
>>>       I realize that the answer to this question may seem pretty obvious
>>> but I am wondering
>>> why mouse4302.db has a different name and not simply the same name
>>> and a different version number.
>>> So I thought I would ask to make sure.
>>>
>>> Thanks and best wishes,
>>> Rich
>>> -----------------------------------------------------------
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Lecturer,
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator,
>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
>>> Box 95, Room 130BB or P&S 1-420C
>>> Columbia University Medical Center
>>> 630 W. 168th St.
>>> New York, NY 10032
>>> (212)305-6901 (5-6901) (voice)
>>> friedman at cancercenter.columbia.edu
>>> http://cancercenter.columbia.edu/~friedman/<http://cancercenter.columbia.edu/%7Efriedman/>
>>>
>>> In Memoriam,
>>> Arthur C. Clarke
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list