[R] Extracting text from html code using the RCurl package.
ggrothendieck at gmail.com
Tue Oct 7 18:52:04 CEST 2008
I gather you are using Windows and in that case you could
use RDCOMClient or rcom to get it via Internet Explorer, e.g.
ie <- COMCreate("InternetExplorer.Application")
URL <- "https://stat.ethz.ch/mailman/listinfo/r-help"
txt <- ie[["document"]][["body"]][["innerText"]]
You may need to run this in elevated mode if you are Vista.
On Mon, Oct 6, 2008 at 11:45 AM, Tony Breyal <tony.breyal at googlemail.com> wrote:
> Dear R-help,
> I want to download the text from a web page, however what i end up
> with is the html code. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
>> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
> Many thanks for any help you can provide,
> Tony Breyal
> R version 2.7.2 (2008-08-25)
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
> attached base packages:
>  stats graphics grDevices utils datasets methods
> other attached packages:
>  XML_1.94-0 RCurl_0.9-4
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help