[BioC] Genbank to Unigene IDs

Sat Apr 17 08:49:01 CEST 2004

Dear John,

Thanks for your suggestion. I can see the attraction of going through 
LocusLink because the LocusLink files are relatively small. But the fact 
that LocusLink is only a subset of GenBank (as pointed out by Dave Waddell) 
seems disasterous. I tried your code on a set of Genbank IDs from a human 
oligo array based on the Compugen 19k library. The code found LocusLink IDs 
for only 4587 of the Genbank IDs. Meanwhile, SOURCE found Unigene IDs for 
16230 of them. So going through LocusLink found the UniGene ID in less than 
30% of cases in which there was one to find.

Gordon

At 11:24 PM 16/04/2004, John Zhang wrote:
> >I have a list of GenBank IDs for which I'd like the corresponding Unigene
> >cluster IDs. What is the easiest way to do this using Bioconductor
> >functions? (I've scanned annotate and AnnBuilder help and vignettes,
> >although way too quickly.)
> >
> >For the sake of being specific, here's a concrete example. What's Unigene
> >for GB="NM_004551"?
>
>Sorry for this delayed posting (I took one day off yesterday)
>
>I think the most direct way of getting the ids maped is to use sources 
>available
>at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file
>contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read
>ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t")
>and then do a matching. If your target file contains RefSeq ids (e. g.
>"NM_130786",    "NM_000014", ...), read
>ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example:
>
> > ids <- c("AC010642", "AF414429", "X56654", "Y08432")
> > ids2ll <-
>as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc", 
>header =
>FALSE, sep = "\t", strip.white = TRUE))
># We only need the second and third column
> > ids2ll <- ids2ll[, c(2, 3)]
> > colnames(ids2ll) <- c("GB", "LL")
># Drop the version number
> > ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1])
> > mapped <- ids2ll[is.element(ids2ll[,1], ids),]
> > mapped
>       GB         LL
>1     "AC010642" "-"
>4     "AF414429" "15778556"
>10671 "X56654"   "30506"
>10677 "Y08432"   "-"
>
>
>
> >
> >Thanks a lot
> >Gordon
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>Jianhua Zhang
>Department of Biostatistics
>Dana-Farber Cancer Institute
>44 Binney Street
>Boston, MA 02115-6084