[R] Web scraping different levels of a website

Ilio Fornasero iliofornasero at hotmail.com
Mon Jan 22 09:27:30 CET 2018


Thanks again, David.

I am trying to figure out a way to convert the lists into a data.frame.

Any hint?

The usual ways (do.call, etc) do not seem to work...

Thanks

Ilio

________________________________
Da: David Jankoski <david.jankoski at hellotrip.nl>
Inviato: venerdì 19 gennaio 2018 15:58
A: iliofornasero at hotmail.com; r-help at r-project.org
Oggetto: Re: [R] Web scraping different levels of a website

Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
    "http://catalog.ihsn.org/index.php/catalog/",
IHSN Survey Catalog<http://catalog.ihsn.org/index.php/catalog/>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of Afghanistan, United Nations Children’s Fund


    "search?view=s&",
    "ps={num_studies}&",
    "page=1&repo=&repo_ref=&sid=&_r=&sk=&vk=&",
    "from={year_from}&",
    "to={year_to}&",
    "sort_order=&sort_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are <a> within <h2> elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")


greetings,
david

On 18 January 2018 at 12:58, David Jankoski <david.jankoski at hellotrip.nl> wrote:
>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
> <div class="survey-row"
> data-url="http://catalog.ihsn.org/index.php/catalog/7118" title="View
Afghanistan - Demographic and Health Survey 2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS Program, ICF ...


> study">
>     <div class="data-access-icon data-access-remote" title="Data
> available from external repository"></div>
>         <h2 class="title">
>             <a href="http://catalog.ihsn.org/index.php/catalog/7118"
Afghanistan - Demographic and Health Survey 2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS Program, ICF ...


> title="Demographic and Health Survey 2015">
>               Demographic and Health Survey 2015
>             </a>
>       </h2>
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the <a> elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- "http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk="
IHSN Survey Catalog<http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of Afghanistan, United Nations Children’s Fund


>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david




--

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com<http://www.hellotrip.com>

	[[alternative HTML version deleted]]



More information about the R-help mailing list