[BioC] Genbank to Unigene IDs
dwaddell at nutecsciences.com
Mon Apr 19 22:17:24 CEST 2004
There are other issues as well i.e. licensing:
There are no restrictions on its use by non-profit institutions as long as
its content is in no way modified and this statement is not removed. Usage
by and for commercial entities requires a license agreement (See
http://www.isb-sib.ch/announce/ or send an email to license at isb-sib.ch ).
and for GOMiner/MatchMiner Barry Zeeberg [zeebergb at mail.nih.gov] says:
Unofficially, pending any corrections from David Kane, as far as I know,
there are no restrictions on either. At the moment, neither is available as
open source, and we are engaged internally in making a decision about this
issue. Both programs have command line interfaces, which allow a great deal
of flexibility in incorporating them in your own custom data processing
stream. There is no restriction whatever on how you choose to do so. Our
basic idea was to make these as freely available as possible, without even
requiring free registration, to lower the barrier to someone using it. There
are frequent updates, as we either fix a problem, add a feature, or make
changes required by changes in external databases from which these programs
draw information, so it is advisable to be on our email list to be kept up
This is an important issue, for me at least, as we annotate Microarrays to
GO (and many other databases). IMHO, to have one of these databases
available from within Bioconductor would greatly increase its value as a
tool to carry out a complete analysis.
A single authoritative database which would consistently provide results
that was being maintained by a competent organization could reduce the
requirement for downloading flat files. MatchMiner is not 100% reliable
right now as can be seen in the output from one of the earlier posts in this
thread but with a little effort (assuming they go open source) this could be
fixed. XML output would definitely be a boon.
From: Robert Gentleman [mailto:rgentlem at jimmy.harvard.edu]
Sent: Monday, April 19, 2004 1:23 PM
To: Dave Waddell
Subject: Re: [BioC] Genbank to Unigene IDs
On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote:
> There are a number of problems in all of the solutions proposed.
> 1. Flat files like Hs are huge and grepping them takes forever.
Yes, but I don't think that anyone is doing that for a production
system (for one off, it may in fact be more efficient depending on
how you measure efficiency).
> 2. Keeping flat files up to date is a waste of bandwidth.
Is there really an option, given that you want to keep up to date?
I know of no standard diff format that would allow us to keep up to
date. Virtually every one of the important public databases uses
different formats and conventions. But if so, please do let us know.
> 3. The annotation really needs to be in some kind of database such as
> SOURCE, Matchminer, DAVID or whatever with indexes on each field so that
> searches can complete in a reasonable period of time.
Yes, and you can easily do that locally - if that is what you want
or do it over the net. The advantage to local is that you have
faster access and you can tailor the database to your needs.
Another option would be to treat these as web services (but I do not
think that they support it, however your comments below suggest that
they might. My scanning of the relevant webpages turned up no clear
callable interface, but I certainly could have missed something).
If one exists then this can be made very simple using the XML
packages and R's connections (no need for Java, nor any need to
exclude it either - if it is your favorite language).
> 4. HTML based tools are handy for small searches but useless if you want
> perform searches with a large number of terms where you expect to get back
> parseable data.
Yes, XML is preferable and many of these DBs could provide it with
little extra effort - but I think we need to start asking them to do
> 5. Many Genbank Accession numbers (ESTs in particular) don't map to
> Locuslink therefore going from Accession number to Locuslink to Unigene
> simply doesn't work i.e. AA683077.
A very good point.
> Matchminer works for me because I'm calling Rserve and Matchminer from
> the response is relatively quick, and I don't have to worry about keeping
> the data current.
Yes, but you do have to worry about repeatability (if they update
between queries). Do they always tell you and can you determine
which actual data resources they used. I'm not saying you cannot,
just raising one of the points of difference between a locally
amalgamated and managed meta-data resource and an on-line one. There
are good points for both (and bad points for both).
Doing your own amalgamation allows for more control over how
disparate data sources get merged (and for some folks that is
Thanks for the interesting comments,
> -----Original Message-----
> From: Gordon Smyth [mailto:smyth at wehi.edu.au]
> Sent: Thursday, April 15, 2004 8:48 PM
> To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee
> Subject: RE: [BioC] Genbank to Unigene IDs
> Dear Jean, Tony, James and Dave,
> Many thanks for your very helpful replies. Just to re-iterate, my interest
> was to map from GenBank from UniGene IDs within R, i.e., write a function
> that will take a character vector or list of GenBank IDs and will return
> the corresponding vector or list of UniGene IDs.
> If one ignores R, the easiest way that I know of to map GenBank to
> UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for
> the GenBank IDs as text strings. (My lab keeps a mirror of the usual
> databases, so downloading isn't actually required if the code is to be
> within my own lab.)
> As as far as R is concerned, you've described a number of methods by which
> the job could be done in principle, but no one has shown actual code to
> answer my example question, "What's Unigene for GB="NM_004551?" Would it
> a fair statement to say that there isn't a reasonably easy way to do the
> job using Bioconductor, and I would be better to stick to the download and
> grep idea (which of course could be done within R if need be)?
> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst
> other problems, AnnBuilder won't load without the XML package, and that
> package is not available for R 1.9.0 under Windows.
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
| Robert Gentleman phone : (617) 632-5250
| Associate Professor fax: (617) 632-2444
| Department of Biostatistics office: M1B20
| Harvard School of Public Health email: rgentlem at jimmy.harvard.edu
More information about the Bioconductor