[R] Extracting a website text content using R

mtmorgan at fhcrc.org mtmorgan at fhcrc.org
Thu Aug 2 04:08:07 CEST 2007


Perhaps more fun is

> library(XML)
> res = htmlTreeParse("http://www.omegahat.org/RSXML/", useInternalNodes=TRUE)
> xpathApply(res, "//h1", xmlValue)
[[1]]
[1] "An XML package for the S language"

Martin

Quoting Steven McKinney <smckinney at bccrc.ca>:

> 
> 
> >-----Original Message-----
> >From: r-help-bounces at stat.math.ethz.ch on behalf of Am Stat
> >Sent: Wed 8/1/2007 2:19 PM
> >To: r-help at stat.math.ethz.ch
> >Subject: [R] Extracting a website text content using R
>  
> >Dear useR,
> 
> >Just wandering whether it is possible that there is any function in R could
> >let me get the text contents for a certain website.
> 
> >Thanks a lot!
> 
> >Best,
> 
> >Leon
> 
> 	
> 
> 
> Is this what you had in mind?
> 
> > foo <- scan(url("http://cran.r-project.org/"), what = "character")
> Read 69 items
> > paste(unlist(foo), collapse = " ")
> [1] "<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML//EN > <html> <head> <title>The
> Comprehensive R Archive Network</title> <link rel=\"icon\"
> href=\"favicon.ico\" type=\"image/x-icon\"> <link rel=\"shortcut icon\"
> href=\"favicon.ico\" type=\"image/x-icon\"> <link rel=\"stylesheet\"
> type=\"text/css\" href=\"R.css\"> </head> <FRAMESET cols=\"1*, 4*\" border=0>
> <FRAMESET rows=\"120, 1*\"> <FRAME src=\"logo.html\" name=\"logo\"
> frameborder=0> <FRAME src=\"navbar.html\" name=\"contents\" frameborder=0>
> </FRAMESET> <FRAME src=\"banner.shtml\" name=\"banner\" frameborder=0>
> <noframes> <h1>The Comprehensive R Archive Network</h1> Your browser seems
> not to support frames, here is the <A href=\"navbar.html\">contents page</A>
> of CRAN. </noframes> </FRAMESET>"
> 
> 
> Try the search phrase
> 
> cran scan url
> 
> in Google for more hits on
> info about R functions that
> can deal with URLs.
> 
> In R try
> 
> > apropos("URL")
>  [1] "contourLines"   "URLdecode"      "URLencode"      "browseURL"     
> "contrib.url"    "main.help.url"  "url.show"      
>  [8] "loadURL"        "read.table.url" "scan.url"       "source.url"    
> "url"           
> 
> 
> SteveM
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list