[BioC] Genbank to Unigene IDs

Fri Apr 16 02:16:42 CEST 2004

Dave,

If all you want is GB -> UG mappings, then you can use the UG class in
AnnBuilder. As normal, you want a two column text file with some sort of
identifier in the first column and the accession numbers in the second.

You then have to open the perl script gbUGparser that is in
rw1090\library\annbuilder\scripts, and change the two instances of
LOCUSLINK to ID. Rename this file something like gbUGparserTest and save
(nuke the .txt subscript if wordpad/notepad puts it there).

Back in R, try this:

tst <- UG(srcUrl=getSrcUrl("UG"),
parser="C:/r/rw1090/library/annbuilder/scripts/gbUGparserTest",
baseFile="myBase")

Data <- parseData(tst)

This will only download the Hs.data.gz file, so should be much quicker
than using ABPkgBuilder(). Also, if you look in the code for
ABPkgBuilder, there is a much more elegant way to get the path to the
perl script. However, when I am just trying to get stuff to work, I find
that ugly and straightforward are the way to go :).

HTH,

Jim

James W. MacDonald
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
>>> rossini at blindglobe.net 04/15/04 5:31 PM >>>

Dave -

Sorry to have led you on a wild goose chase.  We've been much more
successful on Linux builds; one solution was to have pre-downloaded
files, but I can't seem to quickly find our mini-script that did that
(it removed the D/L hassle problem, espec if you are working with
genes which probably aren't changing).

"Dave Waddell" <dwaddell at nutecsciences.com> writes:

> The output from:
> mySrcUrl <- getSrcUrl("UG")
> is
>> mySrcUrl
> [1] "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz"
> this is rejected by ABPkgBuilder:
> "Error in toupper(x) : non-character argument to toupper()"
>
> when getSrcUrl has the ALL argument it gives:
> mySrcUrls <- getSrcUrl(src = "ALL",organism  = "human")
>> mySrcUrls
>  
> LL 
>  
> "ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz" 
>  
> GP 
>  
> "http://www.genome.ucsc.edu/goldenPath/hg16/database/" 
>  
> UG 
>  
> "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz" 
>  
> GO 
>
"http://www.godatabase.org/dev/database/archive/2004-03-01/go_200403-termdb.
> xml.gz" 
>  
> KEGG 
>  
> "ftp://ftp.genome.ad.jp/pub/kegg/pathways" 
>  
> YG 
>  
> "http://www.yeastgenome.org/DownloadContents.shtml" 
>  
> HG 
>  
> "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/hmlg.ftp"
> So I thought I might cheat and use:
> mySrcUrl <- mySrcUrls[3]
>> mySrcUrls[3]
>                                                     UG 
> "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz"
>
> As you can see this gets rejected as well:
> Error in loadFromUrl(srcUrl(object), dist) : 
>         URL NA is incorrect or the target site is not responding!
> Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : 
>         Failed to get or parse UniGene data becaus of:
>
>  Error in loadFromUrl(srcUrl(object), dist) : 
>         URL NA is incorrect or the target site is not responding!
> Is it possible to use Annotation that was created on Linux in the
Windows
> environment? If so, does anyone want to donate it?
> Thanks, Dave.
>
>
> -----Original Message-----
> From: James MacDonald [mailto:jmacdon at med.umich.edu] 
> Sent: Thursday, April 15, 2004 9:52 AM
> To: dwaddell at nutecsciences.com; bioconductor at stat.math.ethz.ch
> Subject: RE: [BioC] Genbank to Unigene IDs
>
> You probably need to update your AnnBuilder. A recent version was
using
> the system temp directory instead of the AnnBuilder temp directory,
> which didn't work well on Win32. AFAIK, the current devel version of
> AnnBuilder has been rolled back to use the AnnBuilder temp dir.
>
> As an aside, if all you need is GB -> UG mappings, it is probably
> overkill to use ABPkgBuilder in this way, which is going to parse
locus
> link and KEGG also (which takes some time). There are two alternatives
> that I can think of, (both untested by me). First, use ABPkgBuilder,
but
> only parse UG by changing the srcUrl to:
>
> mySrcUrl <- getSrcUrl("UG")
>
> Another possiblity is to use the UG class directly. See ?UG. 
>
> Best,
>
> Jim
>
>
>
> James W. MacDonald
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
>>>> "Dave Waddell" <dwaddell at nutecsciences.com> 04/15/04 10:37AM >>>
> I tried running this but got an error:
>> library(AnnBuilder)
>> myBaseType <- "gb"
>> myDir <- "C:/Temp"
>> myBase <- "C:/Temp/tempFile.txt"
>> mySrcUrls <- getSrcUrl(src = "ALL",organism  = "human")
>> ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType =
> +      myBaseType, pkgName = "Hum_Agi1A", pkgPath = myDir,organism =
> +      "human",  version = "1.0",
> +      makeXML = TRUE, author = list(author = "dpritch", maintainer =
> +      "dpritch at u.washington.edu"), fromWeb = TRUE)
> [1] "It may take me a while to process the data. Be patient!"
> Warning message: 
> cannot open file
`C:/R/rw1090beta/library/AnnBuilder/temp/tempOut31783'
>
> Error in unifyMappings(base, ll, ug, otherSrc, fromWeb) : 
>         Failed to get or parse LocusLink data because of:
>
>  Error in file(file, "r") : unable to open connection
>
> I had changed this directory from "Read Only" and checked that I had
> write
> permissions from within R:
>> setwd("C:/R/rw1090beta/library/AnnBuilder/temp")
>> dir()
> [1] "file24842Tgo.xml" "README"          
>> write("Hello")
>> dir()
> [1] "data"             "file24842Tgo.xml" "README"
>
> I get the same error if I run 
> example("ABPkgBuilder")
>
> Any suggestions?
>
> Dave.
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch 
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of A.J.
> Rossini
> Sent: Thursday, April 15, 2004 8:48 AM
> To: Gordon Smyth
> Cc: BioC Mailing List
> Subject: Re: [BioC] Genbank to Unigene IDs
>
> Gordon Smyth <smyth at wehi.edu.au> writes:
>
>> I have a list of GenBank IDs for which I'd like the corresponding
>> Unigene cluster IDs. What is the easiest way to do this using
>> Bioconductor functions? (I've scanned annotate and AnnBuilder help
> and
>> vignettes, although way too quickly.)
>>
>> For the sake of being specific, here's a concrete example. What's
>> Unigene for GB="NM_004551"?
>
> Here's what I'd do (more of a chip-style analysis than instant
> WWW-based gratification, which might also be possible):
>
> 1. First create a tab-separated 2 column file, first row dummy
> probe IDs (could be real or not), second row GB ID's.  So, you'd have
> 1 row in a file called "Dummy.tsv"
>
>
>
> 1    NM_004551
>
>
>
>
> 2.  Have a script similar to:
>
>
>
> library(AnnBuilder)
> myBaseType <- "gb"
> # myDir maps the directory where you want the data package built ---
> # obviously this should be changed for the directory structure on the
> # linux box
> myDir <- "C:/DavidsData/Annotation_Folders"
>
> # myBase maps the file that contains the mapping of Agilent feature
> # numbers to GenBank ID's
> myBase <- "C:/DavidsData/Annotation_Folders/Dummy.tsv"
>
> #use AnnBuilder internal lists of data sources
> mySrcUrls <- getSrcUrl(src = "ALL",organism  = "human")
>
> #invoke ABPkgBuilder
> ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType =
>                       myBaseType, pkgName = "Hum_Agi1A", pkgPath =
> myDir,
> organism =
>                       "human",  version = "1.0",
>                       makeXML = TRUE, author = list(author =
> "dpritch",
> maintainer =
>                      "dpritch at u.washington.edu"), fromWeb = TRUE)
>
> 3. install the package environment
>
> 4. use it to find the IDs (can verify the ID mapping with the XML
> output file, as well)
>
> best,
> -tony
>
> -- 
> rossini at u.washington.edu           
> http://www.analytics.washington.edu/ 
> Biomedical and Health Informatics   University of Washington
> Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer Research
> Center
> UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
> FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email
>
> CONFIDENTIALITY NOTICE: This e-mail message and any\
attachm...{{dropped}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>

-- 
rossini at u.washington.edu            http://www.analytics.washington.edu/

Biomedical and Health Informatics   University of Washington
Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer Research
Center
UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email

CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}