[BioC] Genbank to Unigene IDs

Fri Apr 23 17:52:52 CEST 2004

Dave,

We have discussed your suggestion about an XML interface, and we would be
interested in including one.  In fact, that feature had been on our queue of
possible features for some time, but we did not have an obvious consumer for
the interface.  I have a couple questions.

1) In reading the thread below, it sounds as if there is more interest in
the Lookup interface than the Merge interface.  Is that correct?

2) What sort of usage level did you expect?  I know bio-conductor is very
popular, and I want to make sure that if we commit to providing a service
for use from BioConductor that we are able to meet the expected level of
usage.

Sincerely,
David Kane

P.S. For those of you who have been cc'd on this note, but not on the other
messages between Dave and I, I believe the issue of the connection problem
that Dave alluded to in the note below has been resolved in the latest build
that is on our web site.  If there are other issues, please let us know.

-----Original Message-----
From: Barry Zeeberg [mailto:zeebergb at mail.nih.gov] 
Sent: Monday, April 19, 2004 4:38 PM
To: Dave Waddell; Bioconductor
Cc: Kane, David; Bussey, Kimberly (NIH/NCI); John N. Weinstein
Subject: Re: [BioC] Genbank to Unigene IDs

We are very interested in participating with either for profit or not for
profit organizations, and feedback on what would be helpful would be fed
into our workflow.

Any problems with matchminer or gominer are of concern to us, and we
prioritize correcting these. In addition to the concrete suggestion of XML
output, could you elaborate on the matchminer unreliability issue? It is
possible that we have fixed this already in not yet released update, but we
would like to track and correct any residual problems.

There is a great emphasis now at NIH on technology transfer, and we could
all benefit from the successful use of one of our resources in your product.

barry

On 04/19/04 16:17, "Dave Waddell" <dwaddell at nutecsciences.com> wrote:

> There are other issues as well i.e. licensing:
> For DAVID:
> http://david.niaid.nih.gov/david/ease.htm
> 
> For SOURCE:
> There are no restrictions on its use by non-profit institutions as 
> long as its content is in no way modified and this statement is not 
> removed. Usage by and for commercial entities requires a license 
> agreement (See http://www.isb-sib.ch/announce/ or send an email to 
> license at isb-sib.ch ).
> 
> and for GOMiner/MatchMiner Barry Zeeberg [zeebergb at mail.nih.gov] says: 
> Unofficially, pending any corrections from David Kane, as far as I 
> know, there are no restrictions on either. At the moment, neither is 
> available as open source, and we are engaged internally in making a 
> decision about this issue. Both programs have command line interfaces, 
> which allow a great deal of flexibility in incorporating them in your 
> own custom data processing stream. There is no restriction whatever on 
> how you choose to do so. Our basic idea was to make these as freely 
> available as possible, without even requiring free registration, to 
> lower the barrier to someone using it. There are frequent updates, as 
> we either fix a problem, add a feature, or make changes required by 
> changes in external databases from which these programs draw 
> information, so it is advisable to be on our email list to be kept up 
> to date.
> 
> This is an important issue, for me at least, as we annotate 
> Microarrays to GO (and many other databases). IMHO, to have one of 
> these databases available from within Bioconductor would greatly 
> increase its value as a tool to carry out a complete analysis.
> 
> A single authoritative database which would consistently provide 
> results that was being maintained by a competent organization could 
> reduce the requirement for downloading flat files. MatchMiner is not 
> 100% reliable right now as can be seen in the output from one of the 
> earlier posts in this thread but with a little effort (assuming they 
> go open source) this could be fixed. XML output would definitely be a 
> boon. Dave.
> 
> -----Original Message-----
> From: Robert Gentleman [mailto:rgentlem at jimmy.harvard.edu]
> Sent: Monday, April 19, 2004 1:23 PM
> To: Dave Waddell
> Cc: Bioconductor
> Subject: Re: [BioC] Genbank to Unigene IDs
> 
> On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote:
>> There are a number of problems in all of the solutions proposed. 1. 
>> Flat files like Hs are huge and grepping them takes forever.
> 
> Yes, but I don't think that anyone is doing that for a production 
> system (for one off, it may in fact be more efficient depending on how 
> you measure efficiency).
> 
>> 2. Keeping flat files up to date is a waste of bandwidth.
> 
> Is there really an option, given that you want to keep up to date? I 
> know of no standard diff format that would allow us to keep up to 
> date. Virtually every one of the important public databases uses 
> different formats and conventions. But if so, please do let us know.
> 
> 
>> 3. The annotation really needs to be in some kind of database such as 
>> SOURCE, Matchminer, DAVID or whatever with indexes on each field so 
>> that searches can complete in a reasonable period of time.
> 
> Yes, and you can easily do that locally - if that is what you want or 
> do it over the net. The advantage to local is that you have faster 
> access and you can tailor the database to your needs.
> 
> Another option would be to treat these as web services (but I do not 
> think that they support it, however your comments below suggest that 
> they might. My scanning of the relevant webpages turned up no clear 
> callable interface, but I certainly could have missed something). If 
> one exists then this can be made very simple using the XML packages 
> and R's connections (no need for Java, nor any need to exclude it 
> either - if it is your favorite language).
> 
>> 4. HTML based tools are handy for small searches but useless if you 
>> want
> to
>> perform searches with a large number of terms where you expect to get 
>> back parseable data.
> 
> Yes, XML is preferable and many of these DBs could provide it with 
> little extra effort - but I think we need to start asking them to do 
> so.
> 
> 
>> 5. Many Genbank Accession numbers (ESTs in particular) don't map to 
>> Locuslink therefore going from Accession number to Locuslink to 
>> Unigene simply doesn't work i.e. AA683077.
> 
> A very good point.
> 
>> 
>> Matchminer works for me because I'm calling Rserve and Matchminer 
>> from
> Java,
>> the response is relatively quick, and I don't have to worry about 
>> keeping the data current.
> 
> Yes, but you do have to worry about repeatability (if they update 
> between queries). Do they always tell you and can you determine which 
> actual data resources they used. I'm not saying you cannot, just 
> raising one of the points of difference between a locally amalgamated 
> and managed meta-data resource and an on-line one. There are good 
> points for both (and bad points for both).
> 
> Doing your own amalgamation allows for more control over how disparate 
> data sources get merged (and for some folks that is important).
> 
> Thanks for the interesting comments,
>   Robert
> 
> 
>> Dave.
>> 
>> -----Original Message-----
>> From: Gordon Smyth [mailto:smyth at wehi.edu.au]
>> Sent: Thursday, April 15, 2004 8:48 PM
>> To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean 
>> Yee
> Hwa
>> Yang
>> Subject: RE: [BioC] Genbank to Unigene IDs
>> 
>> Dear Jean, Tony, James and Dave,
>> 
>> Many thanks for your very helpful replies. Just to re-iterate, my 
>> interest
> 
>> was to map from GenBank from UniGene IDs within R, i.e., write a 
>> function that will take a character vector or list of GenBank IDs and 
>> will return the corresponding vector or list of UniGene IDs.
>> 
>>   If one ignores R, the easiest way that I know of to map GenBank to 
>> UniGene IDs is to download Hs.data.gz, and to grep or otherwise 
>> search for
> 
>> the GenBank IDs as text strings. (My lab keeps a mirror of the usual 
>> databases, so downloading isn't actually required if the code is to 
>> be
> used
>> within my own lab.)
>> 
>> As as far as R is concerned, you've described a number of methods by 
>> which
> 
>> the job could be done in principle, but no one has shown actual code 
>> to answer my example question, "What's Unigene for GB="NM_004551?" 
>> Would it
> be
>> a fair statement to say that there isn't a reasonably easy way to do 
>> the job using Bioconductor, and I would be better to stick to the 
>> download and
> 
>> grep idea (which of course could be done within R if need be)?
>> 
>> Cheers
>> Gordon
>> 
>> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. 
>> Amongst other problems, AnnBuilder won't load without the XML 
>> package, and that package is not available for R 1.9.0 under Windows.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch 
>> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor