[R] [Possible SPAM] Reading selected lines in an .html file

Martin Morgan mtmorgan at fhcrc.org
Thu Jun 5 23:07:52 CEST 2008


Staying in R, the XML package in conjunction with the XPATH query
language is likely to be your friend.

> library(XML)
> html=htmlTreeParse("http://www.wunderground.com/global/stations/16239.html", useInternal=TRUE)
> xpathApply(html, "//span[@pwsvariable='tempf' and
+    @pwsid='LIRA']/@value", xmlValue)
[[1]]
[1] "63"

see http://www.w3.org/TR/xpath especially
http://www.w3.org/TR/xpath#path-abbrev for xpath hints.

Martin

Daniel Folkinshteyn <dfolkins at gmail.com> writes:

> i know this is an R mailing list :) but... i'll recommend you try
> python with the beautifulsoup module - makes html processing a cinch.
>
> another thing to note is that wunderground provides very handy RSS
> feeds for every location, so rather than parsing the html page (with
> it's associated bundles of gunk), you'd have a better time parsing the
> RSS feed. (there are some rss parsing libraries for python, too, but
> in your simple case it may be simpler to just extract stuff manually
> with some well-placed regexps)
>
> so use python to pull that out, and append to a nice tab-delimited
> file, and then in your R process just read from that file.
>
> on 06/05/2008 04:45 PM Nutter, Benjamin said the following:
>> I've tried to tackle a similar question at the request of a coworker.
>> Unfortunately, it is difficult to read in HTML code because it lacks
>> character that can consistently be used as a delimiter.  The only
>> guideline I can offer is that any text you're interested in is going to
>> be between a ">" and a "<".  So the goal is to eliminate anything
>> between < and >.
>> What's more, if you really want to read in HTML code, you'll need a
>> good
>> grasp on HTML itself, and some familiarity with how the code you're
>> reading in is structured.  For instance, I'm attaching code that I wrote
>> to read in HTML tables that were generated by other functions commonly
>> used in my work place.  But my code assumes that the tables are written
>> by row (using the <tr> tag.
>> Essentially, after studying the code I was going to read in, I hand
>> picked the markers that I could use to isolate the text I wanted.  I
>> then proceeded to play a game of Simon Says to break down the code to
>> smaller and smaller pieces until I got what I wanted.  Unless you're
>> going to be doing this a lot, I wouldn't recommend taking
>> the time to try and write a function like this.  In most cases it's
>> probably faster just to copy the data by hand.  But if you are
>> determined to make it work, I hope the ideas help.
>> Benjamin
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On Behalf Of vittorio
>> Sent: Wednesday, June 04, 2008 3:50 PM
>> To: r-help at stat.math.ethz.ch
>> Subject: [Possible SPAM] [R] Reading selected lines in an .html file
>> Dear friend, In an R program running permanently on a server I would
>> like to read
>> hour by hour the temperature in *C and the humidity from a  site
>> like this
>> (actually, from many of such sites):
>> http://www.wunderground.com/global/stations/16239.html
>> How can I read the content of the site and select the info I need?
>> Ciao
>> Vittorio
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> ===================================
>> P Please consider the environment before printing this e-mail
>> Cleveland Clinic is ranked one of the top hospitals
>> in America by U.S. News & World Report (2007).  Visit us online at
>> http://www.clevelandclinic.org for
>> a complete listing of our services, staff and
>> locations.
>> Confidentiality Note:  This message is intended for use
>> only by the individual or entity to which it is addressed
>> and may contain information that is privileged,
>> confidential, and exempt from disclosure under applicable
>> law.  If the reader of this message is not the intended
>> recipient or the employee or agent responsible for
>> delivering the message to the intended recipient, you are
>> hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited.  If
>> you have received this communication in error,  please
>> contact the sender immediately and destroy the material in
>> its entirety, whether electronic or hard copy.  Thank you.
>> ------------------------------------------------------------------------
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list