[BioC] unable to find known entrezgene with biomaRt

James MacDonald jmacdon at med.umich.edu
Sat Jan 19 20:14:05 CET 2008


Hi Dick,

What information are you starting with? Do you just need the gene symbol 
and description?

If you have the Entrez Gene ID it is really simple.

 > library(org.Hs.eg.db)
 > get("3514", org.Hs.egSYMBOL)
[1] "IGKC"
 > get("3514", org.Hs.egGENENAME)
[1] "immunoglobulin kappa constant"

If you have multiple IDs, then of course you need to use mget() and then 
wrangle the resulting lists into whatever shape you need. An alternative 
with the sweet new SQLite db format (thanks to the friendly folks in 
Seattle) is to dump everything out and then subset from there.

 > ids <- ls(org.Hs.egSYMBOL)[1:10] ##some random IDs
 > thesymbs <- toTable(org.Hs.egSYMBOL) ##dump
 > thesymbs[thesymbs[,1] %in% ids,]
    gene_id   symbol
1        1     A1BG
2        2      A2M
3        9     NAT1
4       10     NAT2
5       12 SERPINA3
6       13    AADAC
7       14     AAMP
8       15    AANAT
9       16     AARS
10      18     ABAT

If you have the Ensembl ID I would use biomaRt.

 > getBM(c("hgnc_symbol", "description"), "ensembl_gene_id", 
"ENSG00000211592",mart=mart, output="list")
$hgnc_symbol
$hgnc_symbol$ENSG00000211592
[1] NA


$description
$description$ENSG00000211592
[1] "Immunoglobulin Kappa light chain C gene segment 
[Source:IMGT/GENE_DB;Acc:IGKC]"

As noted before, the information from the two sources doesn't always 
agree 100%, which is sorta weird in this case since the description 
field from Ensembl _does_ contain the gene symbol.

Anyway I hope that helps.


Best,

Jim



Dick Beyer wrote:
> Hi Jim,
> 
> Thanks for explaining this to me.  I had assumed that if the gene was in ensembl, then I could get other bits of info such as Entrez Gene ID and such.
> 
> Is there some bioconductor way, similar to biomaRt, to access this Entrez Gene ID?  What I am really using the getBM call for is just to get a gene symbol and a gene description given the Entrez Gene ID.
> 
> Thanks very much,
> Dick  
> *******************************************************************************
> Richard P. Beyer, Ph.D.	University of Washington
> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>  			Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
> http://staff.washington.edu/~dbeyer
> *******************************************************************************
> 
> On Sat, 19 Jan 2008, James W. MacDonald wrote:
> 
>> Hi Dick,
>>
>> I'm not sure I understand your question. When I go to the webpage you 
>> reference, there is AFAICT no mention of this gene being the same as Entrez 
>> Gene 3514 (other than having the same symbol). Nor does Entrez Gene mention 
>> that it is the same as Ensembl Gene ENSG00000211592.
>>
>> A quick look at the location of the gene would imply that it probably is the 
>> same, and not two genes that have the same symbol (which is not unique).
>>
>> Since both the web interface and the programmatic interface agree, this isn't a 
>> matter of inconsistencies between the interfaces, so perhaps the question is 
>> why do Entrez Gene and Ensembl not reference each other?
>>
>> If so, this I think is simply due to the fact that you have two different 
>> groups that are doing the annotation, and they are not always perfect at 
>> referencing each other.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> Dick Beyer wrote:
>>> Hello,
>>>
>>> I am unable to find some Entrez Gene IDs in the ensembl homo sapiens 
>>> database via biomaRt, even though I can access them via the ensembl web.
>>>
>>> library(biomaRt)
>>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>>>
>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3845, 
>>> mart=mart)
>>>    entrezgene hgnc_symbol ensembl_gene_id
>>> 1       3845        KRAS ENSG00000133703
>>>
>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3514, 
>>> mart=mart)
>>> NULL
>>>
>>> The ensembl web interface:
>>>
>>> http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000211592
>>>
>>> shows Entrez Gene ID 3514 corresponds to ensembl_gene_id ENSG00000211592, 
>>> IGKC.
>>>
>>> I'm curious why my biomaRt session will return good results for some valid 
>>> Entrez Gene IDs but not for others.  I'm not sure what to try next.  I'd 
>>> very much appreciate any help.
>>>
>>> sessionInfo()
>>> R version 2.6.1 (2007-11-26)
>>> x86_64-redhat-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>>   [1] topGO_1.4.0         SparseM_0.75        AnnotationDbi_1.0.6
>>>   [4] RSQLite_0.6-4       DBI_0.2-4           GO_2.0.1
>>>   [7] Biobase_1.16.2      graph_1.16.1        biomaRt_1.12.2
>>> [10] RCurl_0.8-3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] cluster_1.11.9  rcompgen_0.1-17 XML_1.93-2
>>>
>>> Thanks much,
>>> Dick
>>> *******************************************************************************
>>> Richard P. Beyer, Ph.D.	University of Washington
>>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>>>  			Seattle, WA 98105-6099
>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>> http://staff.washington.edu/~dbeyer
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list