[BioC] Bioconductor data sets - Unigene IDs

Robert Gentleman rgentlem@jimmy.harvard.edu
Wed, 2 Oct 2002 17:51:30 -0400


Funny that you should mention that, in the soon to be released -- new
and improved builds, you will get just that (and much more).
For example, in the hgu95a package (up later this week),
I get:
UniGene Date built: Build #155.<URL:
     ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz>.

I agree completely that this is what we want. And we have done some
work to ensure that all data sources are adequately documented. 
We (Jianhua and myself) would like to hear from others with similar
concerns, if the new format does not address the issues appropriately.

Robert

On Wed, Oct 02, 2002 at 11:33:00PM +0200, Wolfgang Huber wrote:
> Hi,
> 
> I send this through the mailing list since it may be of general interest. On
> the bioconductor page, there are a number of data packages, such as hgu95a,
> hgu133a. Among other goodies, they contain XDR files with environments full
> of mappings from Affymetrix probe set identifiers to Unigene cluster IDs.
> For example, these may be used to figure out which probe sets from the same,
> or from different chips, represent a given gene.
> 
> Now here's a quote from the NCBI web page:
> "... Since the sequences which make up a cluster may change from week to
> week, and since the cluster identifier may disappear (typically when two
> clusters merge) using the cluster identifier as a reference is ill-advised.
> Using the GB accession numbers of the sequences which comprise the cluster
> is a safe alternative."
> (from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml)
> 
> >From this it sounds that clusters only ever merge, but from what I
> understand there are also situations in which they may split. Hence, Unigene
> cluster IDs are very useful intermediate IDs for processing data at a given
> time, but they cannot be used as persistent identifiers. And whenever one
> does such processing, one has to make sure that all Unigene-IDs involved
> refer to the same Unigene-Build. Now, to come to the point, I haven't seen
> that information (about the Unigene build) in the data packages on
> Bioconductor, and I would like to suggest to include that information in the
> package data, and to document how to find the information in a prominent
> place, so software using these packages can make sure the Unigene IDs are
> consistent.
> 
> Best regards
> Wolfgang.
> 
> Dr. Wolfgang Huber
> http://www.dkfz.de/abt0840/whuber
> Tel +49-6221-424709
> Fax +49-6221-42524709
> DKFZ
> Division of Molecular Genome Analysis
> 69120 Heidelberg
> Germany
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20
| Harvard School of Public Health  email: rgentlem@jimmy.dfci.harvard.edu   |
+---------------------------------------------------------------------------+