[R] Need help reading website info with XML package and XPath

Martin Morgan mtmorgan at fhcrc.org
Tue May 31 18:12:26 CEST 2011


On 05/30/2011 09:04 AM, eric wrote:
> Hi, I'm looking for help extracting some information of the zillow website.
> I'd like to do this for the general case where I manually change the address
> by modifying the url (see code below). With the url containing the address,
> I'd like to be able to extract the same information each time. The specific
> information I'd like to be able to extract includes the homedetails url,
> price (zestimate), number of beds, number of baths, and the Sqft. All this
> information is shown in a bubble on the webpage.
>
> I use the code below to try and do this but it's not working. I know the
> infomation I'm interested in is there because if I print out "doc", I see it
> all in one area. I've attached the relevant section of "doc" that shows and
> highlights all the information I'm interested in (note that either url
> that's highligted in doc is fine).
> http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
> relevant-section-of-doc.pdf

Hi Eric -- the problem is that the highlighted text is not in the XML 
per se, but embedded in a comment. You can extract the text of the 
comment as

getNodeSet(doc, 'string(//div[@id="resurrection-page-state"]/comment()))

you could go on to put some of that text into another XML document and 
use xpath on that, but... you're really 'screen scraping' here, which 
doesn't really showcase what XML is about. If you're trying to learn to 
use XML, then I'd suggest choosing a simpler example. If you're trying 
to corner the housing market (or whatever one does to housing markets) 
then you'll want to find a better data source.

Hope that helps,

Martin

>
> I'm guessing my xpath statements are wrong or getNodeSet needs something
> else to get to information contained in a bubble on a webpage. Any
> suggestions or ideas would be GREATLY appreciated.
>
>
> library(XML)
> url<- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb"
> doc<- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
> f1<- getNodeSet(doc, "//a[contains(@href,'homedetails')]")
> f2<- getNodeSet(doc, "//span[contains(@class,'price')]")
> f3<- getNodeSet(doc, "//LIST[@Beds]")
> f4<- getNodeSet(doc, "//LIST[@Baths]")
> f5<- getNodeSet(doc, "//LIST[@Sqft]")
> g1<-sapply(f1, xmlValue)
> g2<-sapply(f2, xmlValue)
> g3<-sapply(f3, xmlValue)
> g4<-sapply(f4, xmlValue)
> g5<-sapply(f5, xmlValue)
> print(f1)
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-help mailing list