[R] Scrap java scripts and styles from an html document

antujsrv antujsrv at gmail.com
Tue Mar 29 08:38:58 CEST 2011


Hi,

I am working on developing a web crawler in R and I needed some help with
regard to removal of javascripts and style sheets from the html document of
a web page.

i tried using the xml package, hence the function xpathApply
library(XML)
txt =
xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
xmlValue)

The output comes out as text lines, without any html tags. I want the html
tags to remain intact and scrap only the javascript and styles from it. 

Any help would be highly appreciated.
Thanks in advance.


--
View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3413894.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list