[BioC] Help me understand org.Hs.eg.db

Mon Apr 6 18:46:13 CEST 2009

Hi guys,

toTable() is designed to give a different result from the mappedRkeys()
and mappedLkeys(). toTable() is meant to just put the whole mapping in a
table form, while a "mapped(L|R) keys" function only gives the uniquely
mapped (left or right) keys.  As Cristof pointed out, in the case of
gene symbols this is going to sometimes look bad because gene symbols
are really HORRIBLE as identifiers.  Gene symbols are not unique, and
are often "correctly" mapped onto several very different genes as a
result. 

So for example, should CHD5 belong to "chromodomain helicase DNA binding
protein 5" or to "Coronary heart disease, susceptibility to, 5"

The scientific community still has not resolved all of these
"conflicts".  And so we are stuck with this problem.

So for best results, use a real identifier such as an entrez gene ID
when tracking genes.

  Marc

Christof Winter wrote:
> Daren Tan wrote, On 04.04.2009 06:06:
>> I am using two approaches to get EntrezID to genes mapping, as well as
>> genes to EntrezID mappings. toTable gives same number of mappings in
>> both directions, but mget doesn't. Which approach should I trust and
>> why ?
>>
>>> dim(toTable(org.Hs.egSYMBOL2EG))
>> [1] 39824     2
>>> dim(toTable(org.Hs.egSYMBOL))
>> [1] 39824     2
>>
>>> length(mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG))
>> [1] 39800
>>> length(mget(mappedLkeys(org.Hs.egSYMBOL), org.Hs.egSYMBOL))
>> [1] 39824
>
> Dear Daren:
>
> It seems that for some Entrez Gene symbols, there is more than one
> Entrez Gene ID mapped to it:
>
> > x = mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG)
> > sum(listLen(x) > 1)
> [1] 24
>
> If you really care about the correct number, you could look up those
> Entrez Gene IDs at NCBI and decide in each case how to count it:
>
> > x[listLen(x) > 1]
>
> HTH,
> Christof
>