[BioC] libraries or commands to help with parsing or handling web based database queries

Thomas Girke thomas.girke at ucr.edu
Mon Feb 19 19:33:42 CET 2007


Alan,
You will need for this some basic knowledge on how to use regular
expressions within R's grep() and gsub() functions. Additional useful
fuctions are paste() and Sys.sleep().

Rcurl also provides some useful utilities for this approach.

Below is a short example on a similar problem for obtaining peptide 
MW information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html).


###################################################################
myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")
myresult <- NULL
for(i in myentries) {
	myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", 
			i, "&resolution=monoisotopic", sep="")
	x <- url(myurl)
	res <- readLines(x)
	close(x)
	mylines <- res[grep('Theoretical pI/Mw:',res)]
	myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines)))
	print(myresult)
	Sys.sleep(1) # halts process for one sec to give database a break
}
final <- data.frame(Pep=myentries, MW=myresult)
cat("\n The MW values for my peptides are:\n")
print(final)
###################################################################


Thomas


On Mon 02/19/07 11:41, ALAN SMITH wrote:
> Hello Bioconductors
> I am having a very hard time figuring out how to make web based
> database query results into a nice neat table (if such a thing is
> possible in R).  I am constantly searching the metabolite database
> METLIN by copying and pasting addresses.  I have to search this
> database with several hundred entries, often, and would like to
> automate the process to remove the HUGE amount of time I spend doing
> this carpel tunnel creating routine.  I have found several ways to get
> the pages source like.
> 
> library(RCurl)
> test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112.04885&mass_max=112.0555")
> #OR
> url.show("http://metlin.scripps.edu/metabo_list.php?mass_min=112.04885&mass_max=112.0555")
> 
> Once I get the URL info I notice that the data I am interested in is
> between  </form>  and  </table>.
> 
> Are there any packages or methods in R to remove the information I am
> interested in?  I am having problems manipulating STRINGS in R like
> selecting all of the text between two strings.  I am not a programmer.
> 
> Thanks,
> Alan
> 
> Note I am able to use KEGGSOAP without any trouble.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Thomas Girke, Ph.D.
1008 Noel T. Keen Hall
Center for Plant Cell Biology (CEPCEB)
University of California
Riverside, CA 92521

E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437



More information about the Bioconductor mailing list