[BioC] Genbank to Unigene IDs

Dave Waddell dwaddell at nutecsciences.com
Fri Apr 16 21:53:18 CEST 2004

There are a number of problems in all of the solutions proposed.
1. Flat files like Hs are huge and grepping them takes forever.
2. Keeping flat files up to date is a waste of bandwidth.
3. The annotation really needs to be in some kind of database such as
SOURCE, Matchminer, DAVID or whatever with indexes on each field so that
searches can complete in a reasonable period of time.
4. HTML based tools are handy for small searches but useless if you want to
perform searches with a large number of terms where you expect to get back
parseable data.
5. Many Genbank Accession numbers (ESTs in particular) don't map to
Locuslink therefore going from Accession number to Locuslink to Unigene
simply doesn't work i.e. AA683077. 

Matchminer works for me because I'm calling Rserve and Matchminer from Java,
the response is relatively quick, and I don't have to worry about keeping
the data current.

-----Original Message-----
From: Gordon Smyth [mailto:smyth at wehi.edu.au] 
Sent: Thursday, April 15, 2004 8:48 PM
To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa
Subject: RE: [BioC] Genbank to Unigene IDs

Dear Jean, Tony, James and Dave,

Many thanks for your very helpful replies. Just to re-iterate, my interest 
was to map from GenBank from UniGene IDs within R, i.e., write a function 
that will take a character vector or list of GenBank IDs and will return 
the corresponding vector or list of UniGene IDs.

  If one ignores R, the easiest way that I know of to map GenBank to 
UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for 
the GenBank IDs as text strings. (My lab keeps a mirror of the usual 
databases, so downloading isn't actually required if the code is to be used 
within my own lab.)

As as far as R is concerned, you've described a number of methods by which 
the job could be done in principle, but no one has shown actual code to 
answer my example question, "What's Unigene for GB="NM_004551?" Would it be 
a fair statement to say that there isn't a reasonably easy way to do the 
job using Bioconductor, and I would be better to stick to the download and 
grep idea (which of course could be done within R if need be)?


PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst 
other problems, AnnBuilder won't load without the XML package, and that 
package is not available for R 1.9.0 under Windows.

More information about the Bioconductor mailing list