[BioC] annotation - biomaRt - getBM - multiple entrez ID for one ensembl ID

Laure Cougnaud laure.cougnaud at openanalytics.eu
Fri May 3 16:18:22 CEST 2013


Hi James,

Thanks for your response. Indeed this doesn't seem to be an issue of the getBM function, but more about the mapping between the ensembl id and the entrez ids. 

In my case, I have data from an exon array, so after using RMA with this cdf file 'huex10stv2hsensg', I have one value per ENSG, the summarized value of all probes targeting this region.

I understand that several entrez ids seem to map within the location of ENSG00000215417 (Chromosome 13: 92,000,074-92,006,833), but in my case I would be interesting only of the gene ID corresponding to MIR17HG (407975), because it is the only ID that mapped totally to the gene location of ENSG00000215417.

Also, I was a bit confused by the fact that getBM return several gene IDs, but only one symbol (which seems to be the right one for the ENSG), whereas the entrez IDs correspond to different gene symbols, as you pointed it out.

Best,

Laure


----- Original Message -----
From: "James W. MacDonald" <jmacdon at uw.edu>
To: "Laure Cougnaud [guest]" <guest at bioconductor.org>
Cc: bioconductor at r-project.org, "laure cougnaud" <laure.cougnaud at openanalytics.eu>
Sent: Friday, May 3, 2013 3:43:19 PM
Subject: Re: [BioC] annotation - biomaRt - getBM - multiple entrez ID for one ensembl ID

Hi Laure,

On 5/3/2013 2:43 AM, Laure Cougnaud [guest] wrote:
> Hello,
>
> I am currently analyzing data from an exon array. After pre-processing with RMA, with which I obtain a eSet with ensembl IDs, I would like to annotate the gene with Entrez ID. I am using getBM function with as input the ensembl gene ID and as output  the entrez gene ID. Here is a part of the code I am using :
> mart<- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
> gene2genomeEx<- getBM(values = ex, filters = "ensembl_gene_id", mart = mart, attributes = c("ensembl_gene_id", "entrezgene","hgnc_symbol", "external_gene_id", "external_gene_db", "description", "chromosome_name", "strand"))
> However for several genes (and a lot of histone genes), I obtain several entrez IDs for the same ensembl ID for example for :
> ex<- c("ENSG00000215417", "ENSG00000224078", "ENSG00000198366", "ENSG00000196176", "ENSG00000166012", "ENSG00000158406", "ENSG00000196787"), I obtain :
>     ensembl_gene_id entrezgene hgnc_symbol external_gene_id external_gene_db
> 1  ENSG00000158406       8294    HIST1H4H         HIST1H4H      HGNC Symbol
> 2  ENSG00000158406       8359    HIST1H4H         HIST1H4H      HGNC Symbol
> 3  ENSG00000158406       8360    HIST1H4H         HIST1H4H      HGNC Symbol
> 4  ENSG00000158406       8361    HIST1H4H         HIST1H4H      HGNC Symbol
> 5  ENSG00000158406       8362    HIST1H4H         HIST1H4H      HGNC Symbol
> 6  ENSG00000158406       8363    HIST1H4H         HIST1H4H      HGNC Symbol
> 7  ENSG00000158406       8364    HIST1H4H         HIST1H4H      HGNC Symbol
> 8  ENSG00000158406       8365    HIST1H4H         HIST1H4H      HGNC Symbol
> 9  ENSG00000158406       8366    HIST1H4H         HIST1H4H      HGNC Symbol
> 10 ENSG00000158406       8367    HIST1H4H         HIST1H4H      HGNC Symbol
> 11 ENSG00000158406       8368    HIST1H4H         HIST1H4H      HGNC Symbol
> 12 ENSG00000158406       8370    HIST1H4H         HIST1H4H      HGNC Symbol
> 13 ENSG00000158406     121504    HIST1H4H         HIST1H4H      HGNC Symbol
> 14 ENSG00000158406     554313    HIST1H4H         HIST1H4H      HGNC Symbol
> 15 ENSG00000166012      79101       TAF1D            TAF1D      HGNC Symbol
> 16 ENSG00000166012     654320       TAF1D            TAF1D      HGNC Symbol
> 17 ENSG00000166012     677792       TAF1D            TAF1D      HGNC Symbol
> 18 ENSG00000166012     677805       TAF1D            TAF1D      HGNC Symbol
> 19 ENSG00000166012     677822       TAF1D            TAF1D      HGNC Symbol
> 20 ENSG00000166012     692063       TAF1D            TAF1D      HGNC Symbol
> 21 ENSG00000166012     692072       TAF1D            TAF1D      HGNC Symbol
> 22 ENSG00000166012  100302240       TAF1D            TAF1D      HGNC Symbol
> 23 ENSG00000196176       8294    HIST1H4A         HIST1H4A      HGNC Symbol
> 24 ENSG00000196176       8359    HIST1H4A         HIST1H4A      HGNC Symbol
> 25 ENSG00000196176       8360    HIST1H4A         HIST1H4A      HGNC Symbol
> 26 ENSG00000196176       8361    HIST1H4A         HIST1H4A      HGNC Symbol
> 27 ENSG00000196176       8362    HIST1H4A         HIST1H4A      HGNC Symbol
> 28 ENSG00000196176       8363    HIST1H4A         HIST1H4A      HGNC Symbol
> 29 ENSG00000196176       8364    HIST1H4A         HIST1H4A      HGNC Symbol
> 30 ENSG00000196176       8365    HIST1H4A         HIST1H4A      HGNC Symbol
> 31 ENSG00000196176       8366    HIST1H4A         HIST1H4A      HGNC Symbol
> 32 ENSG00000196176       8367    HIST1H4A         HIST1H4A      HGNC Symbol
> 33 ENSG00000196176       8368    HIST1H4A         HIST1H4A      HGNC Symbol
> 34 ENSG00000196176       8370    HIST1H4A         HIST1H4A      HGNC Symbol
> 35 ENSG00000196176     121504    HIST1H4A         HIST1H4A      HGNC Symbol
> 36 ENSG00000196176     554313    HIST1H4A         HIST1H4A      HGNC Symbol
> 37 ENSG00000196787       8329   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 38 ENSG00000196787       8330   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 39 ENSG00000196787       8332   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 40 ENSG00000196787       8336   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 41 ENSG00000196787       8969   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 42 ENSG00000196787      85235   HIST1H2AG        HIST1H2AG      HGNC Symbol
> 43 ENSG00000198366       8350    HIST1H3A         HIST1H3A      HGNC Symbol
> 44 ENSG00000198366       8351    HIST1H3A         HIST1H3A      HGNC Symbol
> 45 ENSG00000198366       8352    HIST1H3A         HIST1H3A      HGNC Symbol
> 46 ENSG00000198366       8353    HIST1H3A         HIST1H3A      HGNC Symbol
> 47 ENSG00000198366       8354    HIST1H3A         HIST1H3A      HGNC Symbol
> 48 ENSG00000198366       8355    HIST1H3A         HIST1H3A      HGNC Symbol
> 49 ENSG00000198366       8356    HIST1H3A         HIST1H3A      HGNC Symbol
> 50 ENSG00000198366       8357    HIST1H3A         HIST1H3A      HGNC Symbol
> 51 ENSG00000198366       8358    HIST1H3A         HIST1H3A      HGNC Symbol
> 52 ENSG00000198366       8968    HIST1H3A         HIST1H3A      HGNC Symbol
> 53 ENSG00000215417     406952     MIR17HG          MIR17HG      HGNC Symbol
> 54 ENSG00000215417     406953     MIR17HG          MIR17HG      HGNC Symbol
> 55 ENSG00000215417     406979     MIR17HG          MIR17HG      HGNC Symbol
> 56 ENSG00000215417     406980     MIR17HG          MIR17HG      HGNC Symbol
> 57 ENSG00000215417     406982     MIR17HG          MIR17HG      HGNC Symbol
> 58 ENSG00000215417     407048     MIR17HG          MIR17HG      HGNC Symbol
> 59 ENSG00000215417     407975     MIR17HG          MIR17HG      HGNC Symbol
> 60 ENSG00000224078      91380      SNHG14           SNHG14      HGNC Symbol
> 61 ENSG00000224078  100033444      SNHG14           SNHG14      HGNC Symbol
> 62 ENSG00000224078  100033450      SNHG14           SNHG14      HGNC Symbol
> 63 ENSG00000224078  100033802      SNHG14           SNHG14      HGNC Symbol
> 64 ENSG00000224078  100033820      SNHG14           SNHG14      HGNC Symbol
> 65 ENSG00000224078  100506948      SNHG14           SNHG14      HGNC Symbol
> The description, chromosome_name and strand are the same for each ensembl gene ID.
> I checked manually for the entrez ID which corresponds to the ensembl ID in ensembl.org, and I found only one entrezID for each gene. Does anyone knows where this problem come from? Is it linked to the nature of my request?

I'm not sure where you are looking, but as an example, for 
ENSG00000215417, I see 7 EntrezGene genes on the Ensembl site, just like 
you have here:

http://www.ensembl.org/Homo_sapiens/Gene/Matches?g=ENSG00000215417;r=13:92000074-92006833

In addition:

 > mget(get(ex[1], revmap(org.Hs.egENSEMBL)), org.Hs.egSYMBOL)
$`407975`
[1] "MIR17HG"

$`406952`
[1] "MIR17"

$`406953`
[1] "MIR18A"

$`406979`
[1] "MIR19A"

$`406980`
[1] "MIR19B1"

$`406982`
[1] "MIR20A"

$`407048`
[1] "MIR92A1"

So I don't see anything unexpected here.

Best,

Jim
>
> Thanks in advance for your help,
>
> Yours sincerely,
>
> Laure Cougnaud
>
>
>   -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
>   [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] biomaRt_2.12.0     affy_1.34.0        Biobase_2.16.0     BiocGenerics_0.2.0 rj_1.1.0-4
>
> loaded via a namespace (and not attached):
> [1] affyio_1.24.0         BiocInstaller_1.4.7   preprocessCore_1.18.0 RCurl_1.91-1          rj.gd_1.1.0-1         tools_2.15.1
> [7] XML_3.9-4             zlibbioc_1.2.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list