[BioC] org.Mm.eg.db gives wrong symbol for MT genes

Marc Carlson mcarlson at fhcrc.org
Tue Aug 13 20:00:04 CEST 2013


As usual, the two of you have sorted things out pretty precisely.

Vince is exactly right about what happened, and Gordon found out exactly 
why when he noticed that NCBI is deliberately renaming all mitochondrial 
symbols (when they can).

I can't say if I necessarily agree with NCBIs decisions here, but if I 
changed these annotations to better match our current expectations, then 
someone else would doubtless wonder why I was contaminating them from 
the source material...  So I am afraid that I probably have to leave 
them as they are.


   Marc



On 08/10/2013 07:20 PM, Gordon K Smyth wrote:
> Hi Vincent,
>
> Thanks, that explains it.  After reading your reply, I went to the 
> NCBI Gene FAQ and found the following explanation:
>
> "NOTE: To the greatest extent possible, each protein-coding gene in 
> mitochondria has been assigned the same name (symbol) and full 
> description across species. In some instances, this is at variance 
> with the symbol assigned by species-specific nomenclature committees."
>
> This would be fine except that (i) the NCBI Gene web interface 
> disagrees with the NCBI gene_info file and (ii) the nomenclature 
> committee symbol from MGI has not be included as a synonym in the 
> gene_info file.
>
> Anyway, the bottom line for my lab is that we will treat the 
> gene_info/org.Mm.eg.db symbols as official, and we will have to give 
> the MT genes special treatment when mapping aliases.
>
> Regards
> Gordon
>
> On Sat, 10 Aug 2013, Vincent Carey wrote:
>
>> Gordon, more definitive answers will likely come from the annotation 
>> core
>> members, but here is what I understand
>> about this.  The mappings are completely dependent on NCBI content.
>>
>> Working with
>>
>> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz 
>>
>>
>> the header is
>>
>> #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome
>> map_location description type_of_gene Symbol_from_nomenclature_authority
>> Full_name_from_nomenclature_authority Nomenclature_status
>> Other_designations Modification_date (tab is used as a separator, pound
>> sign - start of a comment)
>>
>> and, with some context, the record for 17710 is
>>
>>> x[c(1,3516),]
>>     tax_id GeneID Symbol LocusTag             Synonyms
>> 1     10090  11287    Pzp        - A1m|A2m|AI893533|MAM
>> 3516  10090  17710   COX3        -                    -
>>                                                          dbXrefs 
>> chromosome
>> 1 MGI:87854|Ensembl:ENSMUSG00000030359|Vega:OTTMUSG00000022212 6
>> 3516 MGI:102502         MT
>>           map_location                      description type_of_gene
>> 1    6 F1-G3|6 63.02 cM           pregnancy zone protein protein-coding
>> 3516                  - cytochrome c oxidase subunit III protein-coding
>>     Symbol_from_nomenclature_authority
>> Full_name_from_nomenclature_authority
>> 1                                   Pzp pregnancy zone
>> protein
>> 3516                             mt-Co3 cytochrome c oxidase III,
>> mitochondrial
>>     Nomenclature_status
>> Other_designations
>> 1                      O alpha 1
>> macroglobulin|alpha-2-M|alpha-2-macroglobulin
>> 3516                   O
>>  -
>>     Modification_date  X
>> 1             20130804 NA
>> 3516          20130804 NA
>>
>> I would conjecture that the solution needs to come from NCBI -- they may
>> have neglected to deal properly with the MT genes in this case, as the
>> following computation suggests.  The symbols for which field "Symbol" 
>> does
>> not agree
>> with field "Symbol_from_nomenclature_authority" are
>>
>>> xsn[xs!=xsn]
>>   [1] "mt-Atp6" "mt-Atp8" "mt-Co1"  "mt-Co2"  "mt-Co3" "mt-Cytb" 
>> "mt-Nd1"
>>   [8] "mt-Nd2"  "mt-Nd3"  "mt-Nd4"  "mt-Nd4l" "mt-Nd5" "mt-Nd6"  
>> "mt-Rnr1"
>>  [15] "mt-Rnr2" "mt-Ta"   "mt-Tc"   "mt-Td"   "mt-Te" "mt-Tf"   "mt-Tg"
>>  [22] "mt-Th"   "mt-Ti"   "mt-Tk"   "mt-Tl1"  "mt-Tl2" "mt-Tm"   "mt-Tn"
>>  [29] "mt-Tp"   "mt-Tq"   "mt-Tr"   "mt-Ts1"  "mt-Ts2" "mt-Tt"   "mt-Tv"
>>  [36] "mt-Tw"   "mt-Ty"
>>
>>
>> On Fri, Aug 9, 2013 at 11:17 PM, Gordon K Smyth <smyth at wehi.edu.au> 
>> wrote:
>>
>>> Dear Biocore,
>>>
>>> We make a strong effort to use current NCBI official gene symbols and
>>> names in all our work, and we make much use of the excellent 
>>> Bioconductor
>>> packages org.Mm.eg.db and org.Hs.eg.db for this purpose.
>>>
>>> I have recently noticed that org.Mm.eg.db is giving incorrect official
>>> names for mitochondrial genes.  It is giving human symbols for these 
>>> genes
>>> instead of mouse symbols.  For example
>>>
>>>  > mappedRkeys(org.Mm.egSYMBOL["17710"])
>>>   [1] "COX3"
>>>
>>> According to both Entrez Gene
>>>
>>> http://www.ncbi.nlm.nih.gov/**gene/?term=17710<http://www.ncbi.nlm.nih.gov/gene/?term=17710>
>>>
>>> and MGI
>>>
>>> http://www.informatics.jax.**org/marker/MGI:102502<http://www.informatics.jax.org/marker/MGI:102502>
>>>
>>> the official symbol is "mt-Co3".  This has been the official symbol 
>>> for at
>>> least 4 years and probably longer.
>>>
>>> The correct name is not even included as an Alias:
>>>
>>>  > mappedRkeys(revmap(org.Mm.egALIAS2EG)["17710"])
>>>   [1] "COX3"
>>>
>>> COX3 is the actually the symbol for the human ortholog.  It should 
>>> only be
>>> an alias for the mouse gene.
>>>
>>> Same for all the mitochondrial genes.  In all cases, org.Mm.egSYMBOL is
>>> giving the human symbol instead of the mouse symbol.
>>>
>>> Is this deliberate?  If not, can you please fix?
>>>
>>> Thanks a lot
>>> Gordon
>>>
>>> ---------------------------------------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> http://www.statsci.org/smyth
>>>
>>>
>>>  sessionInfo()
>>>>
>>> R version 3.0.1 Patched (2013-07-04 r63183)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_Australia.**1252
>>> [2] LC_CTYPE=English_Australia.**1252
>>> [3] LC_MONETARY=English_Australia.**1252
>>> [4] LC_NUMERIC=C
>>> [5] LC_TIME=English_Australia.1252
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>> [7] methods   base
>>>
>>> other attached packages:
>>> [1] org.Mm.eg.db_2.9.0   org.Hs.eg.db_2.9.0   RSQLite_0.11.4
>>> [4] DBI_0.2-7            AnnotationDbi_1.22.6 Biobase_2.20.0
>>> [7] BiocGenerics_0.6.0   limma_3.17.20
>>>
>>> loaded via a namespace (and not attached):
>>> [1] IRanges_1.18.2 stats4_3.0.1
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:4}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list