[BioC] Genbank to Unigene IDs

John Zhang jzhang at jimmy.harvard.edu
Fri Apr 16 15:24:03 CEST 2004

>I have a list of GenBank IDs for which I'd like the corresponding Unigene 
>cluster IDs. What is the easiest way to do this using Bioconductor 
>functions? (I've scanned annotate and AnnBuilder help and vignettes, 
>although way too quickly.)
>For the sake of being specific, here's a concrete example. What's Unigene 
>for GB="NM_004551"?

Sorry for this delayed posting (I took one day off yesterday)

I think the most direct way of getting the ids maped is to use sources available 
at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file 
contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read 
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t") 
and then do a matching. If your target file contains RefSeq ids (e. g. 
"NM_130786",	"NM_000014", ...), read 
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example:

> ids <- c("AC010642", "AF414429", "X56654", "Y08432")
> ids2ll <-   
as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc", header = 
FALSE, sep = "\t", strip.white = TRUE))
# We only need the second and third column
> ids2ll <- ids2ll[, c(2, 3)]
> colnames(ids2ll) <- c("GB", "LL")
# Drop the version number
> ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1])
> mapped <- ids2ll[is.element(ids2ll[,1], ids),]
> mapped 
      GB         LL        
1     "AC010642" "-"       
4     "AF414429" "15778556"
10671 "X56654"   "30506"   
10677 "Y08432"   "-"

>Thanks a lot
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch

Jianhua Zhang
Department of Biostatistics
Dana-Farber Cancer Institute
44 Binney Street
Boston, MA 02115-6084

More information about the Bioconductor mailing list