[R] Web scraping different levels of a website

Fri Jan 19 15:58:09 CET 2018

Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
    "http://catalog.ihsn.org/index.php/catalog/",
    "search?view=s&",
    "ps={num_studies}&",
    "page=1&repo=&repo_ref=&sid=&_r=&sk=&vk=&",
    "from={year_from}&",
    "to={year_to}&",
    "sort_order=&sort_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are <a> within <h2> elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")

greetings,
david

On 18 January 2018 at 12:58, David Jankoski <david.jankoski at hellotrip.nl> wrote:
>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
> <div class="survey-row"
> data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
> study">
>     <div class="data-access-icon data-access-remote" title="Data
> available from external repository"></div>
>         <h2 class="title">
>             <a href="http://catalog.ihsn.org/index.php/catalog/7118"
> title="Demographic and Health Survey 2015">
>               Demographic and Health Survey 2015
>             </a>
>       </h2>
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the <a> elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="
>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david

-- 

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com