[R] Need help reading website info with XML package and XPath

eric ericstrom at aol.com
Mon May 30 18:04:08 CEST 2011


Hi, I'm looking for help extracting some information of the zillow website.
I'd like to do this for the general case where I manually change the address
by modifying the url (see code below). With the url containing the address,
I'd like to be able to extract the same information each time. The specific
information I'd like to be able to extract includes the homedetails url,
price (zestimate), number of beds, number of baths, and the Sqft. All this
information is shown in a bubble on the webpage.

I use the code below to try and do this but it's not working. I know the
infomation I'm interested in is there because if I print out "doc", I see it
all in one area. I've attached the relevant section of "doc" that shows and
highlights all the information I'm interested in (note that either url
that's highligted in doc is fine). 
http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
relevant-section-of-doc.pdf 

I'm guessing my xpath statements are wrong or getNodeSet needs something
else to get to information contained in a bubble on a webpage. Any
suggestions or ideas would be GREATLY appreciated. 


library(XML)
url <- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb"
doc <- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
f1 <- getNodeSet(doc, "//a[contains(@href,'homedetails')]")
f2 <- getNodeSet(doc, "//span[contains(@class,'price')]")
f3 <- getNodeSet(doc, "//LIST[@Beds]")
f4 <- getNodeSet(doc, "//LIST[@Baths]")
f5 <- getNodeSet(doc, "//LIST[@Sqft]")
g1 <-sapply(f1, xmlValue)
g2 <-sapply(f2, xmlValue)
g3 <-sapply(f3, xmlValue)
g4 <-sapply(f4, xmlValue)
g5 <-sapply(f5, xmlValue)
print(f1)



--
View this message in context: http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list