[BioC] GenBank RefSeq conversion

Sean Davis seandavi at gmail.com
Mon Jun 2 12:19:16 CEST 2008


On Mon, Jun 2, 2008 at 5:03 AM, Eleni Christodoulou <elenichri at gmail.com> wrote:
> Thank you guys,
>
> I saw your answers this morning. I downloaded the package "org.Hs.eg.db",
> but I am struggling a bit with the use of the commands. I am trying for
> example:
> x <- mget("AA868688",org.Hs.egACCNUM2EG)
> and I get the following error:
> Error in .checkKeys(value, Lkeys(x), x at ifnotfound) :
>   invalid key "AA868688"

This means that there is no entry for AA868688.

> This happens with all the GenBank identifiers that I am trying to convert to
> Entrez Gene IDs. What am I doing wrong?

You are not doing anything wrong.  NCBI supplies genbank accession
numbers for what are essentially full-length transcripts that are
associated with a gene.  However, if you look up the accession above,
it is an EST and NCBI does not provide accession-to-gene conversion
directly for such non-full-length accessions.  So, you have a couple
of options:

1)  Use the Stanford SOURCE website to do the conversion for you.  It
will use UniGene mappings to do so.

2)  Build your own annotation package using SQLForge.  This option
will supply you with the mappings that you want in R and in the data
structure of the other annotation packages.

Hope that helps.

Sean

> On Fri, May 30, 2008 at 7:27 PM, Marc Carlson <mcarlson at fhcrc.org> wrote:
>>
>> Sean Davis wrote:
>>>
>>> On Fri, May 30, 2008 at 8:53 AM, Eleni Christodoulou
>>> <elenichri at gmail.com> wrote:
>>>
>>>>
>>>> Hello all!
>>>>
>>>> I was trying to convert RefSeq accession numbers to GenBank accesion
>>>> numbers
>>>> (or the opposite). I think that there must exist a library that does
>>>> this
>>>> job automatically...Does anyone know anything relevant to this?
>>>>
>>>
>>> Hi, Eleni.  There is no direct relationship between RefSeq and GenBank
>>> numbers.  A given RefSeq may or may not be represented by exactly one
>>> GenBank accession.  In fact, a RefSeq may not represent any "real"
>>> sequence, but can be a composite of several "real" sequences.  As an
>>> example, see here:
>>>
>>> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_007294.2
>>>
>>> It looks like this RefSeq is actually composed of 4 different
>>> sequences from genbank (if I am reading the record correctly).
>>>
>>> The only way I know to deal with this (at least in the general case)
>>> is to go through Entrez Gene (or the Ensembl equivalent of a gene) to
>>> find those accessions in GenBank and RefSeq that share a common Gene
>>> ID.  You can do this using the annotation package for the organism of
>>> interest, I think.  Steffen or others might be able to comment on how
>>> to do this using biomaRt.
>>>
>>> Sean
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>> What Sean mentioned should work to at least let you connect the dots.
>>
>> As an example, for human you could use the package "org.Hs.eg.db" and then
>> use the following mappings to get what you want:
>>
>> 1st use "org.Hs.egACCNUM2EG" to get  Entrez Gene IDs for your GenBank
>> accessions.
>>
>> And then use "org.Hs.egREFSEQ" to get RefSeq IDs for your Entrez Gene IDs.
>>
>>
>>   Marc
>
>



More information about the Bioconductor mailing list