[R] Scraping a web page

Sharpie chuck at sharpsteen.net
Fri Dec 4 00:12:57 CET 2009



Michael Conklin wrote:
> 
> I would like to be able to submit a list of URLs of various webpages and
> extract the "content" i.e. not the mark-up of those pages. I can find
> plenty of examples in the XML library of extracting links from pages but I
> cannot seem to find a way to extract the text.  Any help would be greatly
> appreciated - I will not know the structure of the URLs I would submit in
> advance.  Any suggestions on where to look would be greatly appreciated.
> 
> Mike
> 
> W. Michael Conklin
> Chief Methodologist
> 

What kind of "content" are you after? Tables? Chunks of Text?  For tables
you can use the readHTMLTable() function in the XML package.  There was also
some discussion of alternate ways to extract data from tables in this
thread:

 
http://n4.nabble.com/Downloading-data-from-from-internet-td889838.html#a889845

If you're after text, then it's probably a matter of locating the element
that encloses the data you want-- perhaps by using getNodeSet along with an
XPath[1] that specifies the element you are interest with.  The text can
then be recovered using the xmlValue() function.

Hope this helps!

-Charlie

  [1]:  http://www.w3schools.com/XPath/xpath_syntax.asp

-- 
View this message in context: http://n4.nabble.com/Scraping-a-web-page-tp948069p948103.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list