[BioC] Bioconductor data sets - Unigene IDs

Wed, 2 Oct 2002 23:33:00 +0200

Hi,

I send this through the mailing list since it may be of general interest. On
the bioconductor page, there are a number of data packages, such as hgu95a,
hgu133a. Among other goodies, they contain XDR files with environments full
of mappings from Affymetrix probe set identifiers to Unigene cluster IDs.
For example, these may be used to figure out which probe sets from the same,
or from different chips, represent a given gene.

Now here's a quote from the NCBI web page:
"... Since the sequences which make up a cluster may change from week to
week, and since the cluster identifier may disappear (typically when two
clusters merge) using the cluster identifier as a reference is ill-advised.
Using the GB accession numbers of the sequences which comprise the cluster
is a safe alternative."
(from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml)

>From this it sounds that clusters only ever merge, but from what I
understand there are also situations in which they may split. Hence, Unigene
cluster IDs are very useful intermediate IDs for processing data at a given
time, but they cannot be used as persistent identifiers. And whenever one
does such processing, one has to make sure that all Unigene-IDs involved
refer to the same Unigene-Build. Now, to come to the point, I haven't seen
that information (about the Unigene build) in the data packages on
Bioconductor, and I would like to suggest to include that information in the
package data, and to document how to find the information in a prominent
place, so software using these packages can make sure the Unigene IDs are
consistent.

Best regards
Wolfgang.

Dr. Wolfgang Huber
http://www.dkfz.de/abt0840/whuber
Tel +49-6221-424709
Fax +49-6221-42524709
DKFZ
Division of Molecular Genome Analysis
69120 Heidelberg
Germany