[BioC] GB from a package built with ABPkgBuilder

Sean Davis sdavis2 at mail.nih.gov
Tue Jun 8 03:00:09 CEST 2004


Mayte,

If you want to combine data sets, I would suggest using Unigene, as it
naturally takes GenBank Accessions and groups them into clusters based (in a
somewhat algorithmic way) on alignment to the genome.  If one looks at a
Unigene record from Hs.data, it has some specific information about each
cluster (locuslink ID, name, etc.), but is otherwise a long list of sequence
accession numbers and clone ids.  AnnBuilder uses these lists to assign
GenBank accession numbers or IMAGE clones to Unigene clusters.  With "old"
designs of microarrays, some (and perhaps many) of the genbank sequences may
not be mapped at all to a Unigene cluster, in which case you "lose" that
gene.  Also, some arrays were designed and each feature assigned a Unigene
cluster in the past.  Unfortunately, these Unigene ID's may or may not be
comparable to Unigene ID's from the current build, so it is probably
worthwhile re-annotating them based on IMAGE clone or Genbank Accession
(which functions to "update" the array features to the "newest" annotation).
All this is a longwinded way of saying that your best bet is probably to get
all of the arrays to Unigene and then mapping each to the other using common
Unigene IDs.  This is not perfect, but there is not a perfect solution to
this problem currently.  If any of your platforms are comprised of oligos
rather than cDNA, the problem could be much more involved, but can be
approached in a similar manner.  One can use AnnBuilder to do the
re-annotation of each platform or it could be done outside R using perl and
downloading files from Unigene itself.  (I would suggest the former.)

As for why this can't be done with GB IDs, there are MILLIONS of possible
choices for genbank ID, some of which represent truly IDENTICAL sequences.
However, there is no way to tell how similar two GB IDs should be (in terms
of expression behavior) without further processing them, which is exactly
what Unigene is designed to do.

Hope this helps.

Sean
----- Original Message -----
From: "Mayte Suarez-Farinas" <mayte at babel.rockefeller.edu>
To: "Sean Davis" <sdavis2 at mail.nih.gov>
Sent: Monday, June 07, 2004 6:56 PM
Subject: Re: [BioC] GB from a package built with ABPkgBuilder


> On Mon, 7 Jun 2004, Sean Davis wrote:
>
> Sean.
> Thank you for your answer...I will explain you why I need that because
> I need some advice..
>
> I was trying to to use a multistudy aproach as in MergeMaid software
> (Parmigiani recent paper and softw). Lets say that I have 3 studies.
> One of them is from SMD database (lets call it S1). The SMD data comes
> with annotation included so I used that annotation. I initially wanted
> to work with GB as a key for the 3 studies in other to have more genes
> (at some point the sofware take a mean of the measure with the same is,
> that can be avoided but i first tried to use GB).
> GBids interception for studies S2 and S3 are OK, a reasonable number
> but when I did the interception of S1 with either  S2 and S3 (using GB)
> the interception is less than 500 genes (out of 43000 spots in S1).
> However if I make the interception with UG I obtained say 9000. It seems
> like if for the same UG S1 and S2(or S3) got  GBids completely diferent!
> There is some reason for that ??? (I am not very into GBs and annotations
> details)
>
> Then I decided to create an Annotation package for S1 using ABPBuilder as
> I usually do. I did it using UG as key and also the image id. With image
id I got a very poor annotation
> so I just have the ann with UG.
>
> Some suggestion, comments or advice ?? I really appreciate it ...
>
> ps. Everything that I did, was using the last version of everyting and
> updation annotations every week)
>
> > Mayte,
> >
> > Unfortunately, Unigene is a method for "collapsing" perhaps hundreds of
> > genbank accession numbers into a single "Unigene Cluster".  As such, a
> > unigene may represent hundreds of genbank sequences.  It is not very
> > meaningful to get a genbank sequence from this.  There are other
options.
> > First, you can use the refseq sequence for those that have one (this
> > probably makes the most sense, but you would have to think about what
you
> > use will be.)  Second, you could go outside R and collect the "best"
unigene
> > sequence from NCBI (They maintain a file that contains UG ID to genbank
> > accession of the "best" genbank entry representing the unigene).  Third,
you
> > could use the Hs.data file to get ALL the genbank accessions associated
with
> > the UG.  (The ACCNUM environment usually contains only those listed in
the
> > locuslink for the gene, if I'm not mistaken).  There are many other
options.
> > Why do you need them, if I might ask?
> >
> > I'm not sure why you are getting back the UG again when getting from the
> > ACCNUM environment.
> >
> > Sean
> >
> > ----- Original Message -----
> > From: "Mayte Suarez-Farinas" <mayte at babel.rockefeller.edu>
> > To: <bioconductor at stat.math.ethz.ch>
> > Sent: Monday, June 07, 2004 6:13 PM
> > Subject: [BioC] GB from a package built with ABPkgBuilder
> >
> >
> > >
> > > A "simple" question:
> > >
> > > I built an annotation package using UNIGENE ids as key.
> > > Now I want to get the GB annotation for a list of genes.
> > > How can I do ? I use mget with env nameACCNUM but I obtained
> > > the UG again.
> > >
> > > Thanks
> > > --
> > > Mayte Suarez Farinas
> > > The Rockefeller University
> > > 1230 York Avenue, Box 212
> > > New York, NY 10021
> > > phone: 1-212-327-8186
> > > fax:   1-212-327-7422
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> > >
> >
> >
>
> --
> Mayte Suarez Farinas
> The Rockefeller University
> 1230 York Avenue, Box 212
> New York, NY 10021
> phone: 1-212-327-8186
> fax:   1-212-327-7422
>
>
>



More information about the Bioconductor mailing list