[R] Extracting text from html code using the RCurl package.

Tue Oct 21 17:42:40 CEST 2008

Thank you for your response both Martin and Gabor, very much
appreciated!

In case anyone does a search for this topic, i thought i'd write a few
comments below on what I have ended up doing:

re: Internet Explorer (IE) - Finding out that R can access IE was a
very pleasant surprise! This works very well at extracting the plain
text from a html formatted page. The only downsides for me were (1) it
is rather slow if you wish to convert lots of html files into plain
text files, even if the html files are already on your computer, and
(2) when trying to convert some html files, an IE 'pop-up' window may
show up and execution can not continue until that pop up has been
dealt with. There may be ways around this, but I am not aware of them.

## This is an example of the code I used:
library(RDCOMClient)
urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help",
          "http://wiki.r-project.org/rwiki/doku.php?id=getting-
started:what-is-r:what-is-r")
ie <- COMCreate("InternetExplorer.Application")
txt <- list()
for(u in urls) {
  ie$Navigate(u)
  while(ie[["Busy"]]) Sys.sleep(1)
  txt[[u]] <- ie[["document"]][["body"]][["innerText"]]
}
ie$Quit()
print(txt)

re: xpathApply() - I must admit that this was a little confusing when
I first encountered it after reading your post, but after some reading
i think i have found out how to get what i want. This seems to work
almost as well as IE above, but i have found this to be faster for my
purposes probably because there is no need to wait for an external
application, plus there is no danger of a 'pop-up' window showing. As
far as i can tell, all plain text is extracted.

library(RCurl)
library(XML)
urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help",
          "http://wiki.r-project.org/rwiki/doku.php?id=getting-
started:what-is-r:what-is-r")
html.files <- txt <- list()
html.files <- getURL(urls, ssl.verifyhost = FALSE, ssl.verifypeer =
FALSE, followlocation = TRUE)
for(u in urls) {
  html = htmlTreeParse(html.files[[u]], useInternal=TRUE)
  txt[[u]] <- toString(xpathApply(html, "//body//text()
[not(ancestor::script)][not(ancestor::style)]", xmlValue))
}
print(txt)

Cheers,
Tony Breyal

On 6 Oct, 16:45, Tony Breyal <tony.bre... at googlemail.com> wrote:
> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is thehtmlcode. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
> > library(RCurl)
> > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
> >html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
> > print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
> > library(XML)
> > htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,
> Tony Breyal
>
> > sessionInfo()
>
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_1.94-0  RCurl_0.9-4
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.