[R] R: Is there a way to extract some fields data from HTML pages through any R function ?

Martin Morgan mtmorgan at fhcrc.org
Sun Jul 5 21:42:35 CEST 2009


mauede at alice.it wrote:
> I tried to apply the scheme you suggested to open the web page on
> "http://mirecords.umn.edu/miRecords/index.php" and got the followiing:
> 
>> result <- postForm("http://mirecords.umn.edu/miRecords/index.php",
> + searchType="miRNA", species="Homo sapiens",
> +  searchBox="hsa-let-7a", submitButton="Search")

What we are doing here is sometimes called 'screen scraping' -- figuring
out how to extract information from a web page when the information is
not presented in an alternative, more reliable, form. I offered this
route as a response to your specific question, how to extract some
fields from an HTML page, but maybe there is a better way that is
specific to the resources and information you are trying to extract. For
instance, I see on the web page above that there is a link 'Download
validated targets' that leads to an Excel-style spread sheet. Maybe that
is a better route for this resource? I don't know.

In terms of the problem you are encountering above, the fields
searchType, species, searchBox, and submitButton were all defined on the
web page of the resource you mentioned in a previous email; here you
must look at the 'source' (e.g., right-click 'View Page Source' in
Firefox) of the web page you are trying to scrape, and figure out the
appropriate fields. This requires some familiarity with html and html
forms, so that you can recognize what you are looking for. I think on
this particular page you are likely to run in to additional
difficulties, because selection of a 'species' populates the 'mirna_acc'
field with allowable values that combine the miRNA name with the number
of validated targets that will be returned -- you almost need to know
the answer before you can programatically extract the data.

>>  html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
> Unexpected end tag : a
> error parsing attribute name
> Opening and ending tag mismatch: strong and font
> htmlParseStartTag: invalid element name
> Unexpected end tag : a

htmlTreeParse is very forgiving of mal-formed html, and it is telling
you that it has parsed the document, even though it was formatted
incorrectly.

>>  html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE)

There are too many parameters involved to try changing them arbitrarily;
you must take it upon yourself to understand the functions and the
correct way to use them.

Hoping this helps,

Martin

> Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) :
>   File <html><!-- InstanceBegin template="/Templates/admin.dwt"
> codeOutsideHTMLIsLocked="false" -->
> 
> <head>
> 
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> 
> <link href="style/link.css" rel="stylesheet" type="text/css">
> 
> <!-- InstanceParam name="nav_1" type="boolean" value="true" -->
> 
> <title>miRecords</title>
> 
> </head>
> 
> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0"  marginwidth="0"
> marginheight="0">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <table width="80" border="0" cellspacing="0" cellpadding="0">
> 
>   <tr>
> 
>     <td colspan="3"><img src="images/title.jpg" alt="" width=900
> height=79 border="0"></a></td>
> 
>   </tr>
> 
>   <tr>
> 
>     <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td>
> 
>     <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a
> href="redirect.php?s=l" class="menu">Validated Targets </a>  | <a
> href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a
> href="download.php"  class="menu">Download Validated Targets </a> | <a
> href="submit.php" class="m
>>
> 
> 
> 
> I am lost about how to proceed from the above.
> My goal is always to get the VALIDATED miRNA identified and string
> followed by its target  gene's  3'utr sequence-
> 
> Thank you in advance,
> Maura
> 
> P:S. BioMart started to work fine since yesterday
> 
> -----Messaggio originale-----
> Da: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Inviato: mer 01/07/2009 17.51
> A: mauede at alice.it
> Cc: r-help at stat.math.ethz.ch
> Oggetto: Re: [R] Is there a way to extract some fields data from HTML
> pages through any R function ?
> 
> Hi Maura --
> 
> mauede at alice.it wrote:
>> I deal with a huge amount of Biology data stored in different databases.
>> The databases belongig to Bioconductor organization can be accessed
> through Bioconductor packages.
>> Unluckily some useful data is stored in databases like, for instance,
> miRDB, miRecords, etc ... which offer just an
>> interactive HTML interface. See for instance
>>  http://mirdb.org/cgi-bin/search.cgi,
>> 
> http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
> <http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search>
>>
>> Downloading data manually from the web pages is a painstaking
> time-consumung and error-prone activity.
>> I came across a Python script that downloads (dumps) whole web pages 
> into a text file that is then parsed.
>> This is possible because Python has a library to access web pages.
>> But I have no experience with Python programming nor I like such a
> programming language whose syntax is indentation-sensitive.
>>
>> I am *hoping* that there exists some sort of web pages, HTML
> connection  from R ... is there ??
> 
> Tools in R for this are the RCurl package and the XML package.
> 
>   library(RCurl)
>   library(XML)
> 
> Typically this involves manual exploration of the web form, Then you
> might query the web form
> 
>   result <- postForm("http://mirdb.org/cgi-bin/search.cgi",
>                      searchType="miRNA", species="Human",
>                      searchBox="hsa-let-7a", submitButton="Go")
> 
> and parse the results into a convenient structure
> 
>   html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
> 
> you can then use XPath (http://www.w3.org/TR/xpath, especially section
> 2.5) to explore and extract information, e.g.,
> 
>   ## second table, first row
>   getNodeSet(html, "//table[2]/tr[1]")
>   ## second table, makes subsequent paths shorter
>   tbl <- getNodeSet(html, "//table[2]")[[1]]
>   xget <- function(xml, path) # a helper function
>       unlist(xpathApply(xml, path, xmlValue))[-1]
>   df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")),
>                    TargetScore=as.numeric(xget(tbl, "./tr/td[3]")),
>                    miRNAName=xget(tbl, "./tr/td[4]"),
>                    GeneSymbol=xget(tbl, "./tr/td[5]"),
>                    GeneDescription=xget(tbl, "./tr/td[6]"))
> 
> There are many ways through this latter part, probably some much cleaner
> than presented above. There are fairly extensive examples on each of the
> relevant help pages, e.g., ?postForm.
> 
> Martin
> 
> 
>> Thank you very much for any suggestion.
>> Maura
>>
>>
>> tutti i telefonini TIM!
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> 
> Alice Messenger ;-) chatti anche con gli amici di Windows Live Messenger
> e tutti i telefonini TIM!
> Vai su http://maileservizi.alice.it/alice_messenger/index.html?pmk=footer
>




More information about the R-help mailing list