[BioC] R: is there an identifier that uniquely identifies a gene all over the many databases ?

Simon Anders anders at ebi.ac.uk
Mon Jul 13 12:59:15 CEST 2009


Hi

mauede at alice.it wrote:
> I forgot to specify that I am only dealing  with Human species.
> I used the ENSGxxxxx identifier to get out some data that I hoped would 
> uniquely identify the gene.
> 
>  > gene.map <- 
> getBM(attributes=c("hgnc_symbol","external_gene_id","refseq_dna"),
>                                 filters 
> ="ensembl_gene_id",values="ENSG00000206557",mart=hmart)
>  > show(gene.map)
> 
> As long as all Human genes are uniquely identified through their 
> respective "hgnc_symbol" I am fine.
> 
> Why should I use the other identifier you mention ENSTxxxx ?

Well, I mentioned them because you talked about genes and transcripts as 
if these two were interchangeable.

If you use Ensembl's Biomart you will usually get one data record each 
transcript, not for each gene. Take, for example, the gene GLB1 
(ENSG00000170266).

It has three transcripts:
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000170266;r=3:32621636-33121635;t=ENST00000307363

The first transcript (ENST00000307377) has another 3'UTR than the second 
and third (ENST00000307363 and ENST00000399402).

As Steven wrote, you should add "ensembl_transcript_id" to you list of 
attributes to see what is going on.

Personally, I also find it very helpful to first try out any Biomart 
query on the web interface
http://www.ensembl.org/biomart/martview
before going to R. There, you can see quite easily what is going on.

Cheers
   Simon



More information about the Bioconductor mailing list