[BioC] from RefSeq to GO terms / gene symbol to geneID

Simon Lin simonlin at duke.edu
Tue Jun 12 22:39:27 CEST 2007

In the following two unrelated messages, both Sean and Nianhua suggested 
to download and parse some data tables from the NCBI. The gene_info and 
several other tables seems very useful. If that is the case, why not 
have it pre-loaded into a SQlite and distribute it as part of the 
annotation package for human? Simon ================= Date: Tue, 12 Jun 
2007 05:59:55 -0400 From: Sean Davis <sdavis2 at mail.nih.gov> Subject: Re: 
[BioC] from RefSeq GI protein identifiers to GO terms To: Lina 
Hultin-Rosenberg <lina.hultin-rosenberg at ki.se> Cc: 
bioconductor at stat.math.ethz.ch Message-ID: 
<466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; 
charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:

>> Dear list,
>> This might be a question that has been discussed previously but I could not
>> find any good solution for it. I have lists of human proteins from various
>> proteomics studies that I want to compare with regards to the GO terms
>> associated to them. I have the RefSeq GI protein id for the proteins and my
>> questions is how I best map those to other identifiers that I can use in
>> subsequent GO analysis? 
>> It might be that this problem is solved best outside R but maybe someone
>> still can give me a hint to the best solution. For me this is a problem that
>> comes up quite often - the need to map between different identifiers - and I
>> have not yet find any really good solution to it. If I for example use IPI I
>> always loose some proteins/genes since the coverage is rather bad, but maybe
>> there is no solution that will give perfect mapping?!

The file located here:


and described in detail here:


maps refseq to Entrez Gene ID.  Once you have the Entrez Gene ID, you
can use the bioconductor annotation packages to get GO mappings.  The
file above is a tab-delimited text file, so you should be able to read
it into R and do the matching by GI number rather easily.

Hope that helps.


Message: 4
Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
From: Nianhua Li <nialicn at yahoo.com>
Subject: Re: [BioC] getting Locus Link ids from gene symbol
To: bioconductor at stat.math.ethz.ch
Message-ID: <loom.20070611T142932-100 at post.gmane.org>
Content-Type: text/plain; charset=us-ascii

Hi, Alex,

You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol 
(column 3), and Synonyms (column 5). You can:

1 Read in the file
2 filter it based on tax_id
3 match your gene symboles to the "Symbol" column and find their Gene ID
4 removed the matched gene symboles from your list
5 match the rest of gene symboles to the "Synonyms" column and find their Gene 

hope this helps


Nianhua Li
Software Developer

More information about the Bioconductor mailing list