[R] parsing Google search results

Tony B tony.breyal at googlemail.com
Tue Nov 17 17:54:01 CET 2009


Hi Philip,

If i understood correctly, you just wish to get the urls from a given
google search? I have some old code you could adapt which extracts the
main links from a google search. It makes use of XPath expressions
using the lovely XML and RCurl packages:

> library(XML)
> library(RCurl)
>
> getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE) {
+   search.term <- gsub(' ', '%20', search.term)
+   if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
+   getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
+ }
>
> getGoogleLinks <- function(google.url) {
+   doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
+   html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
+   ## the next line is very important to parse the html ##
+   nodes <- getNodeSet(html, "//a[@href][@class='l']")
+   return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]))
+ }
>
>
> search.term <- "cran"
> quotes <- "FALSE"
>
> search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
>
> links <- getGoogleLinks(search.url)
> links
 [1] "http://cran.r-project.org/"              "http://cran.r-
project.org/web/packages/" "http://www.cranmusic.com/"
"http://www.sizes.com/units/cran.htm"
 [5] "http://www.r-project.org/"               "http://www.myspace.com/
cranmusic"        "http://www.rozcran.co.uk/"               "http://
www.cherylcran.com/"
 [9] "http://www.chriscran.com/"               "http://
www.cranhillranch.com/"           "http://www.yumsugar.com/
6262265"         "http://www.yumsugar.com/6262259"

Hope that helps a little,
Tony Breyal

On 16 Nov, 19:29, Philip Leifeld <Leif... at coll.mpg.de> wrote:
> Hi,
>
> how can I parse Google search results? The following code returns
> "integer(0)" instead of "1" although the results of the query clearly
> contain the regex "cran".
>
> ####
> address <- url("http://www.google.com/search?q=cran")
> open(address)
> lines <- readLines(address)
> grep("cran", lines[3])
> ####
>
> Thanks
>
> Philip
>
> --
> Philip Leifeld
> Max Planck Institute for     | +49 (0) 1577 6830349 (mobile)
> Research on Collective Goods | +49 (0) 228 91416-73 (phone)
> MaxNetAging Doctoral Fellow  | +49 (0) 228 91416-62 (fax)
> Kurt-Schumacher-Str. 10      |
> 53113 Bonn, Germany          |http://www.philipleifeld.de
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list