[BioC] libraries or commands to help with parsing or handlingweb based database queries

Sean Davis sdavis2 at mail.nih.gov
Tue Feb 20 14:43:39 CET 2007


Alan,

Have you looked at using the XML package?  Depending on how malformed the HTML 
is, it may be useful, as it is designed to parse these types of data.

Sean


On Tuesday 20 February 2007 08:07, Benjamin Otto wrote:
> Hi Alan,
>
> Which parts are you interested in exactly?
> Looking at the page there are MID, MASS, Name, Formula information which
> seem to be more easily extracted from the code. However the structure seems
> a little bit more tricky to me.
>
> Regards
>
> Benjamin
>
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Thomas Girke
> Gesendet: 19 February 2007 19:34
> An: ALAN SMITH
> Cc: bioconductor at stat.math.ethz.ch
> Betreff: Re: [BioC] libraries or commands to help with parsing or
> handlingweb based database queries
>
> Alan,
> You will need for this some basic knowledge on how to use regular
> expressions within R's grep() and gsub() functions. Additional useful
> fuctions are paste() and Sys.sleep().
>
> Rcurl also provides some useful utilities for this approach.
>
> Below is a short example on a similar problem for obtaining peptide MW
> information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html).
>
>
> ###################################################################
> myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")
> myresult <- NULL
> for(i in myentries) {
> 	myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=",
> 			i, "&resolution=monoisotopic", sep="")
> 	x <- url(myurl)
> 	res <- readLines(x)
> 	close(x)
> 	mylines <- res[grep('Theoretical pI/Mw:',res)]
> 	myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines)))
> 	print(myresult)
> 	Sys.sleep(1) # halts process for one sec to give database a break
> }
> final <- data.frame(Pep=myentries, MW=myresult)
> cat("\n The MW values for my peptides are:\n")
> print(final)
> ###################################################################
>
>
> Thomas
>
> On Mon 02/19/07 11:41, ALAN SMITH wrote:
> > Hello Bioconductors
> > I am having a very hard time figuring out how to make web based
> > database query results into a nice neat table (if such a thing is
> > possible in R).  I am constantly searching the metabolite database
> > METLIN by copying and pasting addresses.  I have to search this
> > database with several hundred entries, often, and would like to
> > automate the process to remove the HUGE amount of time I spend doing
> > this carpel tunnel creating routine.  I have found several ways to get
> > the pages source like.
> >
> > library(RCurl)
>
> test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112.04885&
>m ass_max=112.0555")
>
> > #OR
> >
> >
> >
> > Once I get the URL info I notice that the data I am interested in is
> > between  </form>  and  </table>.
> >
> > Are there any packages or methods in R to remove the information I am
> > interested in?  I am having problems manipulating STRINGS in R like
> > selecting all of the text between two strings.  I am not a programmer.
> >
> > Thanks,
> > Alan
> >
> > Note I am able to use KEGGSOAP without any trouble.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
>
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list