[BioC] How to get ENTREZID from Gene symbol in bioconductor

Thu Mar 13 00:26:59 CET 2008

Srinivas Iyyer wrote:
> Dear Marc, 
> thanks for the tip. 
>
> I obtained gene symboles from hgug133plus2SYMBOL env
> (from probesets if u133plus2). 
>
> I do not have have data matrix for these genes. I just
> have only list of gene symbols. 
>
>
> Is there a way to juggle between SYMBOL <-> PROBEsetID
> <-> SYMBOL/ENTREZID/....and rest of all
> functionalities. 
>
>   
>> xx = mget(msba, revmap(org.Hs.egSYMBOL))
>>     
> Error in .checkKeys(value, Rkeys(x), x at ifnotfound) :
> invalid key "KNTC2"
>
> I get the error with org.Hs.egSYMBOL.
>
> Thanks
> Srini
>
>
>   

The error you are listing here just means that the symbol KNTC2 is not
in the environment you are searching.  Since you say you got the list of
genes from the hgu133plus2.db package, it makes me suspicious that your
packages are not all from the same time period.  Do you think you could
show your sessionInfo() for me?  If your annotation packages are all
from the same build, then the symbols that you get from hgu133plus2.db
should be found inside of the org.Hs.eg.db package.  Otherwise all bets
are off since these annotations necessarily change over time (which is
why we make a new set of builds every 6 months).

Using recent annotation packages from devel, I don't find "official" (by
which I mean primary) gene symbols for KNTC2 in either package (or at
NCBI for humans).  This symbol is listed at NCBI only an "alternate
symbol" which means you can only expect to get a value back of you use
the org.Hs.egALIAS2EG map.  That is because this map has all the
standard symbols plus all the alternate symbols within it.  In other
words this should work: 

mget("KNTC2", org.Hs.egALIAS2EG)

I am guessing that you have an older annotation package for hgug133plus2
that is from a time when KNTC2 was considered to be the primary gene
symbol for entrez gene  ID = 10403.  That would cause the error you are
reporting.  But this is all speculation without your sessionInfo(). 
Here is mine:

> sessionInfo()
R version 2.7.0 Under development (unstable) (2008-03-06 r44691)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices datasets  utils     methods
[8] base

other attached packages:
[1] hgu133plus2.db_2.1.3 org.Hs.eg.db_2.1.3   AnnotationDbi_1.1.25
[4] RSQLite_0.6-8        DBI_0.2-4            Biobase_1.17.15

In general I would urge extreme caution when using gene symbols to map
to anything.  They are absolutely awful as identifiers since there is no
guarantee of uniqueness and they are prone to changing on the whims of
the people who coin them.  We have done what we can to make them
accessible, but please be careful when using gene symbols.  I am not
sure what exactly you are asking with your more general mapping
question, but the package hgu133plus2.db is really a "probe set centric"
package.  That means that everything in it maps (somehow) to a probeset
ID.  In contrast the org.Hs.eg.db package is really an "Entrez Gene
centric" package.

Hope this helps you,

    Marc