[R] webscraping a multi-level website

Ilio Fornasero |||o|orn@@ero @end|ng |rom hotm@||@com
Thu Apr 18 10:35:36 CEST 2019


Hello.
I am trying to webscrape a website including some link from which I have to pick info.This is a thing going on since some day.

## Yet, I am getting the page and the urls I am interested in.
url <- "http://www.fao.org/countryprofiles/en/"
webscrape <- read_html(url)

urls <- webscrape %>%
  html_nodes(".linkcountry") %>%
  html_attr("href") %>%
  as.character()


## This is a chance to get the single links
urls <- paste0("http://www.fao.org", urls)



## Nevertheless, I prefere this option:
urls <- paste0("http://www.fao.org", urls_country <- data.frame(country=character(), country_url=character()))


## Then I loop to attain News

for (i in urls) {
  webscrape1 <- read_html(i)
  country <- webscrape1 %>%
    html_nodes(".#newsItems") %>%
    html_text() %>%
    as.character()

  country_url <- webscrape1 %>%
    html_nodes(".#newsItems") %>%
    html_attr("href") %>%
    as.character()

  temp_fao <- data.frame(country,country_url)

  urls_country <- rbind(urls_country,temp_fao)

  cat("*")
}



In any case, I get the following message:

Error in open.connection(x, "rb") :
  Could not resolve host: www.fao.orginteger(0)

Any hint?
Thanks in advance

	[[alternative HTML version deleted]]



More information about the R-help mailing list