[R] Extracting a data.frame from HTML code

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 13 01:37:48 CEST 2008


Hi Ethan --

Use the XML library

> library(XML)
> url <- 'http://www.nascar.com/races/cup/2007/1/data/standings_official.html'
> xml <- htmlTreeParse(url, useInternal=TRUE)

The previous line retrieves the html and stores it in an internal
represnetation. There are warnings, but I think these are about
ill-formed HTML at nascar.com

A little looking suggests that the data you're after are table data
(element 'td') inside table rows ('tr') inside a 'tbody' element. A
little bit more looking shows that there's a blank line in the table,
at unlucky row 13, I guess.

So what we'd like to do is to extract all 'td' elements from all the
rows but unlucky 13. We do this with an 'xpath' query, which specifies
the path, from the root of the document through the relevant nodes, to
the data that we want. Here's the query and data extraction

> q <- "//tbody/tr[position()!=13]/td"
> dat <- unlist(xpathApply(xml, q, xmlValue))

The '//tbody' says 'find any tbody node somewhere below the current
(i.e., root, at this point in the query) node', '/' says 'immediately
below the current', we have some access to basic logic testing to
subset the nodes we're after. xmlValue extracts the 'value' (text
content, roughly) of the nodes that we've described the path to. This
is a nice weekend hack, relying on the overall structure of the table
and assuming, for instance, that there is only one tbody on the
page. We'd have to work harder during the week.

And then some R to make it into a data frame

> df <- as.data.frame(t(matrix(dat, 11)))

(11 because we've counted how many columns there are in the table; we
could have discovered this from the document, e.g.,
"count(//tbody/tr[1]/td)" as the xpath). The columns are all
character, whereas you'd like some to be numeric.

The page at http://www.w3.org/TR/xpath is very helpful for xpath,
especially section 2.5.

Hope that helps,

Martin

"Ethan Pew" <ethanpew+rlist at gmail.com> writes:

> Dear all,
>
> I'd like to use R to read in data from the web. I need some help finding an
> efficient way to strip the HTML tags and reformat the data as a data.frame
> to analyze in R.
>
> I'm currently using readLines() to read in the HTML code and then grep() to
> isolate the block of HTML code I want from each page, but this may not be
> the best approach.
>
> A short example:
> x1 <- readLines("
> http://www.nascar.com/races/cup/2007/1/data/standings_official.html",n=-1)
>
> grep1 <- grep("<table",x1,value=FALSE)
> grep2 <- grep("</table>",x1,value=FALSE)
>
> block1 <- x1[grep1:grep2]
>
>
> It seems like there should be a straightforward solution to extract a
> data.frame from the HTML code (especially since the data is already
> formatted as a table) but I haven't had any luck in my searches so far.
> Ultimately I'd like to compile several datasets from multiple webpages and
> websites, and I'm optimistic that I can use R to automate the process.  If
> someone could point me in the right direction, that would be fantastic.
>
> Many thanks in advance,
> Ethan
>
>
>
> Ethan Pew
> Doctoral Candidate, Marketing
> Leeds School of Business
> University of Colorado at Boulder
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list