[BioC] Genbank to Unigene IDs

Robert Gentleman rgentlem at jimmy.harvard.edu
Mon Apr 19 20:22:43 CEST 2004

On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote:
> There are a number of problems in all of the solutions proposed.
> 1. Flat files like Hs are huge and grepping them takes forever.

 Yes, but I don't think that anyone is doing that for a production
 system (for one off, it may in fact be more efficient depending on
 how you measure efficiency).

> 2. Keeping flat files up to date is a waste of bandwidth.

 Is there really an option, given that you want to keep up to date?
 I know of no standard diff format that would allow us to keep up to
 date. Virtually every one of the important public databases uses
 different formats and conventions. But if so, please do let us know.

> 3. The annotation really needs to be in some kind of database such as
> SOURCE, Matchminer, DAVID or whatever with indexes on each field so that
> searches can complete in a reasonable period of time.

  Yes, and you can easily do that locally - if that is what you want
  or do it over the net. The advantage to local is that you have
  faster access and you can tailor the database to your needs.

  Another option would be to treat these as web services (but I do not
  think that they support it, however your comments below suggest that
  they might. My scanning of the relevant webpages turned up no clear
  callable interface, but I certainly could have missed something).
  If one exists then this can be made very simple using the XML
  packages and R's connections (no need for Java, nor any need to
  exclude it either - if it is your favorite language).

> 4. HTML based tools are handy for small searches but useless if you want to
> perform searches with a large number of terms where you expect to get back
> parseable data.

  Yes, XML is preferable and many of these DBs could provide it with
  little extra effort - but I think we need to start asking them to do

> 5. Many Genbank Accession numbers (ESTs in particular) don't map to
> Locuslink therefore going from Accession number to Locuslink to Unigene
> simply doesn't work i.e. AA683077. 

  A very good point.

> Matchminer works for me because I'm calling Rserve and Matchminer from Java,
> the response is relatively quick, and I don't have to worry about keeping
> the data current.

  Yes, but you do have to worry about repeatability (if they update
  between queries). Do they always tell you and can you determine
  which actual data resources they used. I'm not saying you cannot,
  just raising one of the points of difference between a locally
  amalgamated and managed meta-data resource and an on-line one. There
  are good points for both (and bad points for both).

  Doing your own amalgamation allows for more control over how
  disparate data sources get merged (and for some folks that is

  Thanks for the interesting comments,

> Dave.
> -----Original Message-----
> From: Gordon Smyth [mailto:smyth at wehi.edu.au] 
> Sent: Thursday, April 15, 2004 8:48 PM
> To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa
> Yang
> Subject: RE: [BioC] Genbank to Unigene IDs
> Dear Jean, Tony, James and Dave,
> Many thanks for your very helpful replies. Just to re-iterate, my interest 
> was to map from GenBank from UniGene IDs within R, i.e., write a function 
> that will take a character vector or list of GenBank IDs and will return 
> the corresponding vector or list of UniGene IDs.
>   If one ignores R, the easiest way that I know of to map GenBank to 
> UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for 
> the GenBank IDs as text strings. (My lab keeps a mirror of the usual 
> databases, so downloading isn't actually required if the code is to be used 
> within my own lab.)
> As as far as R is concerned, you've described a number of methods by which 
> the job could be done in principle, but no one has shown actual code to 
> answer my example question, "What's Unigene for GB="NM_004551?" Would it be 
> a fair statement to say that there isn't a reasonably easy way to do the 
> job using Bioconductor, and I would be better to stick to the download and 
> grep idea (which of course could be done within R if need be)?
> Cheers
> Gordon
> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst 
> other problems, AnnBuilder won't load without the XML package, and that 
> package is not available for R 1.9.0 under Windows.
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20                            |
| Harvard School of Public Health  email: rgentlem at jimmy.harvard.edu        |

More information about the Bioconductor mailing list