[BioC] Hg18 build of org.Hs.eg.db

Tue Sep 21 19:31:08 CEST 2010

Hello everyone,

In general, you should be reaching for the GenomicFeatures package when
you need things like chromosomal positions so that you can get the most
control over which build you are using etc.  The org. packages are meant
to be "gene-centric" instead of "genome centric", which may seem like a
fine point to parse on, but I think it's significant in this case. 

The org packages do have some limited chromosomal position information,
but this is in there because people needed something a very long time
ago, long before we had better solutions (ie. GenomicFeatures).  It is
still sometimes useful to have chromosomal positions in these packages
and so they have remained there, but in most cases the org packages are
really meant for simply gene-centric annotations that can be attached
directly to something like an entrez gene ID.  Things like GO terms and
Unigene IDs.

In part because the genomic ranges of transcripts and genes can vary
with the build, these kinds of annotations required us to apply a
modular solution.  The basic notion is that you can use the
GenomicFeatures package to get appropriate Genomic range information
about your genome of interest and then use the org package to connect
additional gene-centric information based on genes you are working with
from your ranges in GenomicFeatures.

Finally, for the CHRLOC and CHRLOCEND annotations in the org packages:
these are actually based on the UCSC Genome builds.  The most recent
versions of these should specify the build being used and you can see
the information about where a particular annotation mapping is coming
from in its related manual page.  The CHR mapping on the other hand
comes from NCBI.  It's all documented in the manual pages, and has been
this way for ages, but I wouldn't blame you if you said this was
somewhat confusing.  If we were to add more of this kind of information
to the org packages, things could definitely get much more confusing. 
And this sort of issue is another reason why we have stopped adding new
genomic range annotations to the org packages and instead created the
GenomicFeatures package. 

I hope this helps to clarify things somewhat.  Please let me know if I
left questions unanswered?

  Marc

On 09/21/2010 09:55 AM, Sean Davis wrote:
> On Tue, Sep 21, 2010 at 12:43 PM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>
>   
>> On Tue, Sep 21, 2010 at 12:16 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>     
>>> Hi,
>>>
>>> On Tue, Sep 21, 2010 at 12:09 PM, Vincent Carey
>>> <stvjc at channing.harvard.edu> wrote:
>>>       
>>>> I believe the best way would be to install the "release" version of R
>>>> that coincides with the org.Hs.eg.db
>>>> version that you are interested in, and install it into that version
>>>> of R with biocLite.  In my lab we have
>>>> simultaneous availability of R 2.10, 2.11, and 2.12 (and now 2.13) to
>>>> allow continuity of ongoing analyses.
>>>> The user must pick the appropriate version.
>>>>         
>>> While this generally works, it's somehow non optimal for this
>>> particular problem especially because the package in question is an
>>> annotation package.
>>>
>>> Consider the situation where the user wants to use Rsamtools (not
>>> available in R 2.10) but is performing analyses against hg18.
>>>       
>> Fair point.
>>
>>     
>>> One could argue that perhaps the GenomicFeatures package would be
>>> better suited to replace the genome-version-specific info I'm guessing
>>> the OP wanted from org.Hs.eg.db, but I'm just giving an example.
>>>       
>> Perhaps.  The GenomicFeatures approach does allow the user to specify the
>> build and makes the user responsible for maintaining that image of the
>> feature metadata -- it is not part of
>> a distributed package.  And if I am not mistaken, the way one will
>> most likely get richer metadata from the
>> makeTranscriptDb... result is by decoding the transcript names or
>> entrez ids used with an org.Hs*
>> table.
>>
>> If one really needs an hg18-based set of org.Hs* mappings, one might
>> be able to use sqlForge facilities
>> of annotationDbi package to make it and install it alongside the
>> distributed org.Hs*, with a distinguishing
>> package name.  I say "might" here because the reliance of this
>> facility on possibly genome-build-dependent metadata
>> is not completely clear to me; Marc Carlson may want to comment.
>>
>>
>>     
> Just to add here that the org.Hs.eg.db packages are based on NCBI data,
> generally speaking.  The GenomicFeatures stuff is generally based on UCSC
> data (or some other source chosen by the user).  The annotations from these
> various sources will not be equivalent in all cases, particularly when it
> comes to chromosome positions.
>
> To make things even more complicated, since there have been numerous refseq
> releases from NCBI over the three years from 2006-2009, the org.Hs.eg.db
> packages during those years will generally NOT be the same from one release
> to the next even though the reference genome was unchanged.
>
>
>   
>> But I do believe one wants to avoid the temptation of taking an
>> "older" org.Hs* package and installing it in
>> a newer, mismatched version of R.  You might get away with it in some
>> circumstances but it is not a good
>> idea.  It is possible to create some simple interversion
>> communications so that answers to queries against an R 2.10-based
>> resource can be used in R 2.12 for example, and that would be much safer.
>>
>>
>>     
> This point cannot be overstated.  While in some cases this might work,
> designing workflows with these kinds of hacks in place are bound to lead to
> difficulties at some point (always 2 days before a grant deadline or a
> national talk).
>
>
>   
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>>       
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>