[BioC] building a new annotation

Hervé Pagès hpages at fhcrc.org
Fri Mar 13 04:47:07 CET 2009

Hi Chirag,

You don't give us a lot of details about the schema of your database.
You just say there is a main table. That suggests there are other tables,
but how do they relate to the main table? How tables relate to each
other is very important and is what we've formalized with the notion of
L2Rchain (left-to-right chain) and L2Rlink objects in AnnotationDbi.
The idea is that any map we define in our .db packages can be
described with an (L2R) chain of (L2R) links. This is a high level
description of the map. The advantage of such description is that:

   (a) Defining a map doesn't require you to write any SQL statement.
       The SQL code is automatically generated from the high-level
       description when the user queries the map (with mget(),
       keys(), get(), etc...).

   (b) It's easy to define new maps.

   (c) Some operations/transformations on maps are easier to do at the
       high level (e.g. adding/modifying a filter, plugging maps together,
       etc..., unfortunately those operations are not available for yet).

So what's an L2R chain? Here is an imaginary database:

           table1    table2    table3
           ------    ------    ------
           col1a     col2a     col3a
           col1b     col2b     col3b
           col1c     col2c

A map can be seen as a path that goes from any column in the db (e.g.
table1.col1c) to any other column in the db (e.g. table2.col2b).
The L2R chain describes the path that must be followed to go from
table1.col1c (the leftmost col of the map) to table2.col2b (the
rightmost col of the map). This path is described with 1 or more
L2R links. For example, mapA could be described with 3 links:

   1st link: table1.col1c -> table1.col1a
   2nd link: table3.col3a -> table3.col3b
   3rd link: table2.col2d -> table2.col2b

Note that the left and right columns of a given link always belong
to the same table. The simplest kind of map is mapping 2 columns
of the same table and is described with just 1 link. To define this
kind of map, just use Marc's createSimpleBimap() function.

But what happens between links? What does it mean that link 1
[table1.col1c -> table1.col1a] is followed by link 2
[table3.col3a -> table3.col3b]? It means that columns table1.col1a
and table3.col3a are in relation i.e. that the values they contain
are of the same type and referring to the same entities. Most of the
time, this will appear explicitly in the SQL schema: there will be
a foreign key between the 2 columns, but not always. Also, most of the
time, the 2 columns in relation will have the same name, but not always.
In the end it's up to you to decide whether it makes sense or not to
put 2 columns in relation.

When it's time to extract data from the map, each relation between 2
links will be translated into an SQL join. For example, when extracting
all the data from mapA (with 'toTable()' or 'as.list()'), an SQL statement
will be generated that will more or less look like this:

SELECT table1.col1c,table2.col2b FROM table1
   INNER JOIN table3 ON table1.col1a=table3.col3a
   INNER JOIN table2 ON table3.col3b=table2.col2d;

(In practice, things are a little bit more complicated. To see exactly
what's generated, turn on SQL debugging mode with AnnotationDbi:::debugSQL())

If you look at the hgu95av2ENZYME map in hgu95av2.db:

 > str(hgu95av2ENZYME)
Formal class 'AnnDbBimap' [package "AnnotationDbi"] with 8 slots
   ..@ L2Rchain  :List of 2
   .. ..$ :Formal class 'L2Rlink' [package "AnnotationDbi"] with 8 slots
   .. .. .. ..@ tablename   : chr "probes"
   .. .. .. ..@ Lcolname    : chr "probe_id"
   .. .. .. ..@ tagname     : chr NA
   .. .. .. ..@ Rcolname    : chr "_id"
   .. .. .. ..@ Rattribnames: chr(0)
   .. .. .. ..@ Rattrib_join: chr NA
   .. .. .. ..@ filter      : chr "1"
   .. .. .. ..@ altDB       : chr(0)
   .. ..$ :Formal class 'L2Rlink' [package "AnnotationDbi"] with 8 slots
   .. .. .. ..@ tablename   : chr "ec"
   .. .. .. ..@ Lcolname    : chr "_id"
   .. .. .. ..@ tagname     : chr NA
   .. .. .. ..@ Rcolname    : chr "ec_number"
   .. .. .. ..@ Rattribnames: chr(0)
   .. .. .. ..@ Rattrib_join: chr NA
   .. .. .. ..@ filter      : chr "1"
   .. .. .. ..@ altDB       : chr(0)
   ..@ direction : int 1
   ..@ Lkeys     : chr NA
   ..@ Rkeys     : chr NA
   ..@ ifnotfound: list()
   ..@ datacache :<environment: 0x2413308>
   ..@ objName   : chr "ENZYME"
   ..@ objTarget : chr "chip hgu95av2"

You can see it has 2 links:

   [probes.probe_id -> probes._id]
   [ec._id -> ec.ec_number]

The probes._id and ec._id columns both contain internal gene ids i.e.
arbitrary integers that we use within the scope of the hgu95av2.db
package to uniquely refer to genes (the mapping between this internal
id and the real Entrez id is stored in the 'genes' table). So the 2
columns are naturally in relation.

Most maps in hgu95av2.db are made of two L2R links. But hgu95av2ACCNUM
for example is made of one link only.
Some maps in GO.db are made of 3 links where the leftmost and rightmost
columns belong to the same table (but the path between them goes thru
another table).

Look at the R/createAnnObjs.*_DB.R files in AnnotationDbi, they contain
the L2Rchain/L2Rlink description of all the predefined maps that you
find in our .db packages. For example createAnnObjs.HUMANCHIP_DB.R
contains the definition of all the maps found in hgu95av2.db (and any
other .db package based on the HUMANCHIP_DB schema, use
'dbmeta(hgu95av2_dbconn(), "DBSCHEMA")' to get the name of the
underlying db schema).

Those map definitions are stored in the HUMANCHIP_DB_AnnDbBimap_seeds
object (list of lists of etc... there are many nested levels). You'll
need to reproduce something like this in your own annotation package
and then call AnnotationDbi:::createAnnDbBimaps() on it to create the
maps. Look at the code for the details. There are a lot of details to
take care of but I can't cover them all here.

Hope this gets you started. Let us know if you need further help.


Marc Carlson wrote:
> Hi Chirag,
> createSimpleBimap is really meant for the case where someone is using an
> custom annotation package that they have generated using SQLForge (you
> don't want to do that), and they have added a single table which
> contains all the information that they wish to represent.  In this very
> simple case, createSimpleBimap() will add a mapping to your package. 
> But otherwise you will probably want to have a look at (as an example)
> the createAnnObjs.HUMANCHIP_DB.R in the AnnotationDbi package, and also
> at the zzz.R inside the hgu95av2.db package for an example of how these
> mappings can be set up.  If you look at these examples you will see some
> L2RChains being used to define the set of mappings needed for a package.
> Please keep the conversation "on list" so that others can benefit from
> your questions.  And while we are on that topic, this conversation would
> probably be a better fit on the bioc-devel mailing list than here. 
> Because you are really talking about defining a new set of interfaces
> for interacting with a completely different SQLite database schema than
> anything else we support.  And actually, you really might not need to
> make a set of mappings at all.  You might instead just want to write
> some simple functions to retrieve pertinent data from the database.  I
> still don't know which of the data in this database you want to use or
> what you want to do with it, so it's difficult for me to really advise
> you on what is more appropriate at this time.
>   Marc
> Chirag Patel wrote:
>> Marc,
>> Thanks so much for your response... AnnotationDbi may be the way to go
>> for me.
>> I have a couple of more questions.  I am working through the
>> vigenette, and I am having trouble understanding how the objects are
>> mapped to the underlying db.  How exactly do we create these objects? 
>> I am guessing that I should start with 'createSimpleBimap'.
>> For example, if we use the example of the affy annotation db,
>> "hgu95av2.db", we have the bimpa objects hgu95av2ACCNUM,
>> hgu95av2ALIAS2PROBE, etc...
>> How do we specify these objects?
>> And what is the 'L2Rchain' structure you talk about below?
>> Thanks,
>> Chirag
>> On Mar 12, 2009, at 10:38 AM, Marc Carlson wrote:
>>> Hi Chirag,
>>> If you are building this to a custom database that you already have in
>>> hand the you cannot use SQLforge because that will try to make a
>>> customized database for you.  And AnnBuilder is gone now (and would not
>>> have helped you here anyways).  Instead, you might want to look closely
>>> at the code in AnnotationDbi which defines several types of databases
>>> along with the mappings to represent the underlying DB data in R using
>>> an L2Rchain structure.  Access to these structures outside the domain of
>>> AnnotationDbi is planned to be made more accessible in the future.
>>> Alternatively, (depending completely upon what kind of access you want
>>> to provide to your users), you could also  pretty easily just write some
>>> simple accessors to talk to this database.  Direct access to SQLite
>>> databases is pretty straightforward from R using the RSQlite and DBI
>>> packages.  There are some examples of this in the AnnotationDbi vignette
>>> of this direct style of access that you can look at here.
>>> http://www.bioconductor.org/packages/devel/bioc/html/AnnotationDbi.html
>>> If you have further questions please let me know,
>>>  Marc
>>> Chirag Patel wrote:
>>>> Hello,
>>>> I would like to build a new annotation using data from the CTD
>>>> (http://ctd.mdibl.org).
>>>> This data contains in sqlite DB a main table with the schema:
>>>> entrez_gene_id, chemical_id, relation_id, and pubmed_citation_id.
>>>> Relation_id is a internal id I use to manage relations between the
>>>> chemical and genes.  Chemical_id is an id used by the CTD to identify
>>>> chemicals.
>>>> How may I best do this using the tools available on bioconductor?-- I
>>>> was thinking of using AnnBuilder or AnnotationDbi, but am unsure if
>>>> this is the right way to go; this is a first time building a package
>>>> or an annotation.
>>>> Any help would be much appreciated,
>>>> Chirag
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioconductor mailing list