[R] Web scraping different levels of a website

Thu Jan 18 12:58:05 CET 2018

Hey Ilio,

On the main website (the first link that you provided) if you
right-click on the title of any entry and select Inspect Element from
the menu, you will notice in the Developer Tools view that opens up
that the corresponding html looks like this

(example for the same link that you provided)

<div class="survey-row"
data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
study">
    <div class="data-access-icon data-access-remote" title="Data
available from external repository"></div>
        <h2 class="title">
            <a href="http://catalog.ihsn.org/index.php/catalog/7118"
title="Demographic and Health Survey 2015">
              Demographic and Health Survey 2015
            </a>
      </h2>

Notice how the number you are after is contained within the
"survey-row" div element, in the data-url attribute. Or alternatively
withing the <a> elem within the href attribute. It's up to you which
one you want to grab but the idea would be the same i.e.

1. read in the html
2. select all list-elements by css / xpath
3. grab the fwd link

Here is an example using the first option.

url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="

x <-
  url %>%
  GET() %>%
  content()

x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

hth.
david