[R] Download CSV Files from EUROSTAT Website

Paul Bivand paul.bivand at gmail.com
Wed Nov 6 00:31:27 CET 2013


This looks as though you need to be a little XML old-school.
readHTMLTable is a summary function drawing on:

?htmlTreeParse() turns the table into xml
?xpathApply()
and more.

#xpathApply(doc, , "//td", function(x)xmlValue(x)) breaks each line at
the end of a table cell and extracts the value

# The "//th" picks out the table headings without distinction as to
whether they are rows or columns

Followed by various gsub()  and turning it into a matrix (as this
comes out with a list of values without columns. I couldn't identify
the headings, but the table body is definitely doable.

readHTMLTable seems to assume that the column headings are a single
row, which isn't always the case.

Paul Bivand


On 5 November 2013 18:44, Barry Rowlingson <b.rowlingson at lancaster.ac.uk> wrote:
> On 4 Nov 2013 19:30, "David Winsemius" <dwinsemius at comcast.net> wrote:
>
>> Maybe you should use their "download" facility rather than trying to
> deparse a complex webpage with lots of special user interaction "features":
>>
>> http://appsso.eurostat.ec.europa.eu/nui/setupDownloads.do
>>
>
> That web page depends on the user already having been to the previous page
> to set up a session and so directly downloading a dataset requires setting
> up cookies and making sure the request has all the right parameters. Looks
> like a right pain.
>
> --
>> David.
>> >
>>
>> On Nov 4, 2013, at 11:03 AM, Lorenzo Isella wrote:
>>
>> > Thanks.
>> > I had already introduced this minor adjustments in the code, but the
> real problem (to me) is the information that gets lost: the informative
> name of the columns, the indicator type and the units.
>>
>> > Cheers
>> >
>> > Lorenzo
>> >
>> > On Mon, 04 Nov 2013 19:52:51 +0100, Rui Barradas <ruipbarradas at sapo.pt>
> wrote:
>> >
>> >> Hello,
>> >>
>> >> If you want to get rid of the (bp) stuff, you can use lapply/gsub.
> Using Jean's code a bit changed,
>> >>
>> >> library(XML)
>> >>
>> >> mylines <- readLines(url("http://bit.ly/1coCohq"))
>> >> closeAllConnections()
>> >> mytable <- readHTMLTable(mylines, which = 2, asText=TRUE,
> stringsAsFactors = FALSE)
>> >>
>> >> str(mytable)
>> >>
>> >> mytable[] <- lapply(mytable, function(x) gsub("\\(.*\\)", "", x))
>> >> mytable[] <- lapply(mytable, function(x) gsub(",", "", x))
>> >> mytable[] <- lapply(mytable, as.numeric)
>> >>
>> >> colnames(mytable) <- 2000:2013
>> >>
>> >>
>> >> Hope this helps,
>> >>
>> >> Rui Barradas
>> >>
>> >> Em 04-11-2013 09:53, Lorenzo Isella escreveu:
>> >>> Hello,
>> >>> And thanks a lot.
>> >>> This is indeed very close to what I need.
>> >>> I am trying to figure out how not to "lose" the headers and how to
> avoid
>> >>> downloading labels like "(p)" together with the numerical data I am
>> >>> interested in.
>> >>> If anyone on the list knows how to make this minor modifications, s/he
>> >>> will make my life much easier.
>> >>> Cheers
>> >>>
>> >>> Lorenzo
>> >>>
>> >>>
>> >>> On Fri, 01 Nov 2013 14:25:49 +0100, Adams, Jean <jvadams at usgs.gov>
> wrote:
>> >>>
>> >>>> Lorenzo,
>> >>>>
>> >>>> I may be able to help you get started.  You can use the XML package
> to
>> >>>> grab the information >off the internet.
>> >>>>
>> >>>> library(XML)
>> >>>>
>> >>>> mylines <- readLines(url("http://bit.ly/1coCohq"))
>> >>>> closeAllConnections()mylist <- readHTMLTable(mylines,
>> >>>> asText=TRUE)mytable <- mylist1$xTable
>> >>>>
>> >>>> However, when I look at the resulting object, mytable, it doesn't
> have
>> >>>> informative row or >column headings.  Perhaps someone else can figure
>> >>>> out how to get that information.
>> >>>>
>> >>>> Jean
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Oct 31, 2013 at 10:38 AM, Lorenzo Isella
>> >>>> <lorenzo.isella at gmail.com> wrote:
>> >>>>> Dear All,
>> >>>>> I often need to do some work on some data which is publicly
> available
>> >>>>> on the EUROSTAT >>website.
>> >>>>> I saw several ways to download automatically mainly the bulk data
>> >>>>> from EUROSTAT to later on >>postprocess it with R, for instance
>> >>>>>
>> >>>>> http://bit.ly/HrDICj
>> >>>>> http://bit.ly/HrDL10
>> >>>>> http://bit.ly/HrDTgT
>> >>>>>
>> >>>>> However, what I would like to do is to be able to download directly
>> >>>>> the csv file >>corresponding to a properly formatted dataset
>> >>>>> (typically a dynamic dataset) from EUROSTAT.
>> >>>>> To fix the ideas, please consider the dataset at the following link
>> >>>>>
>> >>>>> http://bit.ly/1coCohq
>> >>>>>
>> >>>>> what I would like to do is to automatically read its content into R,
>> >>>>> or at least to >>automatically download it as a csv file (full
>> >>>>> extraction, single file, no flags and >>footnotes) which I can then
>> >>>>> manipulate easily.
>> >>>>> Any suggestion is appreciated.
>> >>>>> Cheers
>> >>>>>
>> >>>>> Lorenzo
>> >>>>>
>> >>>>> ______________________________________________
>> >>>>> R-help at r-project.org mailing list
>> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>> PLEASE do read the posting guide
>> >>>>> http://www.R-project.org/posting-guide.html
>> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >>> ______________________________________________
>> >>> R-help at r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> PLEASE do read the posting guide
>> >>> http://www.R-project.org/posting-guide.html
>> >>> and provide commented, minimal, self-contained, reproducible code.
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list